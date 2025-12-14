To obtain new releases, run:

Copy Copied! # Download Helm chart $ helm fetch \https://helm.ngc.nvidia.com/nvidia/charts/network-operator-25.10.0.tgz $ ls network-operator-\*.tgz | xargs -n 1 tar xf

Edit the values-<VERSION>.yaml file as required for your cluster.

To apply the Helm chart update, run:

Copy Copied! $ helm upgrade -n nvidia-network-operator network-operator nvidia/network-operator --version=<VERSION> -f values-<VERSION>.yaml --force

Note Helm upgrade does not update components version in the NicClusterPolicy. It should be done manually after the upgrade is done.

Note The network operator has some limitations as to which updates in the NicClusterPolicy it can handle automatically. If the configuration for the new release is different from the current configuration in the deployed release, some additional manual actions may be required. Known limitations: If the configuration for devicePlugin changed without image upgrade, manual restart of the devicePlugin may be required. These limitations will be addressed in future releases.

Update the components version in the NicClusterPolicy. Refer to the NicClusterPolicy CRD Full Example for more details and latest version of the components.

To enable automatic DOCA-OFED Driver upgrade, define the UpgradePolicy section for the ofedDriver in the NicClusterPolicy spec, and change the DOCA-OFED Driver version.

nicclusterpolicy.yaml :

Copy Copied! apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy namespace: nvidia-network-operator spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: doca3.2.0-25.10-1.2.8.0-2 upgradePolicy: # autoUpgrade is a global switch for automatic upgrade feature # if set to false all other options are ignored autoUpgrade: true # maxParallelUpgrades indicates how many nodes can be upgraded in parallel # 0 means no limit, all nodes will be upgraded in parallel maxParallelUpgrades: 0 # cordon and drain (if enabled) a node before loading the driver on it safeLoad: false # describes the configuration for waiting on job completions waitForCompletion: # specifies a label selector for the pods to wait for completion podSelector: "app=myapp" # specify the length of time in seconds to wait before giving up for workload to finish, zero means infinite # if not specified, the default is 300 seconds timeoutSeconds: 300 # describes configuration for node drain during automatic upgrade drain: # allow node draining during upgrade enable: true # allow force draining force: false # specify a label selector to filter pods on the node that need to be drained podSelector: "" # specify the length of time in seconds to wait before giving up drain, zero means infinite # if not specified, the default is 300 seconds timeoutSeconds: 300 # specify if should continue even if there are pods using emptyDir deleteEmptyDir: false

Apply NicClusterPolicy CR:

Copy Copied! $ kubectl apply -f nicclusterpolicy.yaml

Warning To be able to drain nodes, make sure to fill the PodDisruptionBudget field for all the pods that use it. On some clusters (e.g. Openshift), many pods use PodDisruptionBudget, which makes draining multiple nodes at once impossible. Since evicting several pods that are controlled by the same deployment or replica set, violates their PodDisruptionBudget, those pods are not evicted and in drain failure. To perform a driver upgrade, the network-operator must evict pods that are using network resources. Therefore, in order to ensure that the network-operator is evicting only the required pods, the upgradePolicy.drain.podSelector field must be configured.

The status upgrade of each node is reflected in its nvidia.com/ofed-driver-upgrade-state label . This label can have the following values:

Name Description Unknown (empty) The node has this state when the upgrade flow is disabled or the node has not been processed yet. upgrade-done Set when DOCA-OFED Driver POD is up-to-date and running on the node, the node is schedulable. upgrade-required Set when DOCA-OFED Driver POD on the node is not up-to-date and requires upgrade. No actions are performed at this stage. node-maintenance-required Set when requestor mode upgrade is used, e.g. MAINTENANCE_OPERATOR_ENABLED=true , post upgrade-required state. Essentially it will create a matching nodeMaintenance object for dedicated node(s), utilizing maintenance operator to perform its node operations. cordon-required Set when the node needs to be made unschedulable in preparation for driver upgrade. wait-for-jobs-required Set on the node when waiting is required for jobs to complete until the given timeout. drain-required Set when the node is scheduled for drain. After the drain, the state is changed either to pod-restart-required or upgrade-failed. pod-restart-required Set when the DOCA-OFED Driver POD on the node is scheduled for restart. After the restart, the state is changed to uncordon-required. uncordon-required Set when DOCA-OFED Driver POD on the node is up-to-date and has “Ready” status. After uncordone, the state is changed to upgrade-done upgrade-failed Set when the upgrade on the node has failed. Manual interaction is required at this stage. See Troubleshooting section for more details.

Warning Depending on your cluster workloads and pod Disruption Budget, set the following values for auto upgrade: Copy Copied! apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy namespace: nvidia-network-operator spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: doca3.2.0-25.10-1.2.8.0-2 upgradePolicy: autoUpgrade: true maxParallelUpgrades: 1 drain: enable: true force: false deleteEmptyDir: true podSelector: ""

DOCA-OFED Driver upgrade supports the following modes:

Mode Description In-place In-place (legacy) mode is incorporates full driver upgrade lifecycle, including nodes operations e.g. cordon, pod eviction, drain, uncordon. It also maintains an internal scheduler for performing above node operations, according to provided maxParallelUpgrades under UpgradePolicy . Requestor New requestor upgrade mode uses NVIDIA maintenance operator (please refer to maintenance-operator repo) nodeMaintenance k8s API objects, to initiate the DOCA-OFED driver upgrade process. Essentially, it will retire current upgrade controller (in-place mode) from performing the following node operations: cordon, wait for pods completion, drain, uncordon. To enable requestor mode, the following environment variable should be enabled MAINTENANCE_OPERATOR_ENABLED=true .

Note Enabling requestor mode will require deployment of NVIDIA maintenance operator on the cluster. By default, upgrade controller will use in-place mode. nodeMaintenanceNamePrefix is used to distinguish between different (operators) requestors, requesting node maintenance operations on the same node(s). Deploying maintenance operator, as well as enabling requestor mode, setting requestors env variables MAINTENANCE_OPERATOR_REQUESTOR_ID , MAINTENANCE_OPERATOR_REQUESTOR_NAMESPACE , MAINTENANCE_OPERATOR_NODE_MAINTENANCE_PREFIX , can be done through Network Operator helm values.yaml :

Copy Copied! maintenanceOperator: enabled: true maintenance-operator-chart: operatorConfig: maxParallelOperations: 2 maxUnavailable: 2 operator: maintenanceOperator: useRequestor: true requestorID: "nvidia.network.operator" nodeMaintenanceNamePrefix: "network-operator" nodeMaintenanceNamespace: default

Warning The state of this feature can be controlled with the ofedDriver.upgradePolicy.safeLoad option.

Upon node startup, the DOCA-OFED Driver container takes some time to compile and load the driver. During that time, workloads might get scheduled on that node. When DOCA-OFED Driver is loaded, all existing PODs that use NVIDIA NICs will lose their network interfaces. Some such PODs might silently fail or hang. To avoid this situation, before the DOCA-OFED Driver container is loaded, the node should get cordoned and drained to ensure all workloads are rescheduled. The node should be un-cordoned when the driver is ready on it.

The safe driver loading feature is implemented as a part of the upgrade flow, meaning safe driver loading is a special scenario of the upgrade procedure, where we upgrade from the inbox driver to the containerized DOCA-OFED Driver.

When this feature is enabled, the initial DOCA-OFED Driver driver rollout on the large cluster can take a while. To speed up the rollout, the initial deployment can be done with the safe driver loading feature disabled, and this feature can be enabled later by updating the NicClusterPolicy CRD.

Issue Required Action The node is in upgrade-failed state. Drain the node manually by running kubectl drain –ignore-daemonsets.

Delete the NVIDIA DOCA-OFED Driver pod on the node manually, by running the following command: kubectl delete pod -n `kubectl get pods --A --field-selector spec.nodeName=<node name> -l nvidia.com/ofed-driver --no-headers | awk '{print $1 " "$2}'` . NOTE: If the “Safe driver loading” feature is enabled, you may also need to remove the nvidia.com/ofed-driver-upgrade.driver-wait-for-safe-load annotation from the node object to unblock the loading of the driver kubectl annotate node <node_name> nvidia.com/ofed-driver-upgrade.driver-wait-for-safe-load- Wait for the node to complete the upgrade. The updated NVIDIA DOCA-OFED Driver pod failed to start/ a new version of NVIDIA DOCA-OFED Driver cannot be installed on the node. Manually delete the pod by using kubectl delete -n <Network Operator Namespace> <pod name> . If following the restart the pod still fails, change the NVIDIA DOCA-OFED Driver version in the NicClusterPolicy to the previous version or to another working version.

Automatic DOCA-OFED Driver upgrade is the preferred method for upgrading the DOCA-OFED Driver. However, if you need to manually upgrade the DOCA-OFED Driver, you can follow the steps below.

Warning This operation is required only if containerized DOCA-OFED Driver is in use.

When a containerized DOCA-OFED Driver is reloaded on the node, all pods that use a secondary network based on NVIDIA NICs will lose network interface in their containers. To prevent outage, remove all pods that use a secondary network from the node before you reload the driver pod on it.

The Helm upgrade command will only upgrade the DaemonSet spec of the DOCA-OFED Driver to point to the new driver version. The DOCA-OFED Driver’s DaemonSet will not automatically restart pods with the driver on the nodes, as it uses “OnDelete” updateStrategy. The old DOCA-OFED Driver version will still run on the node until you explicitly remove the driver pod or reboot the node:

Copy Copied! $ kubectl delete pod -l app=mofed-<OS_NAME> -n nvidia-network-operator

It is possible to remove all pods with secondary networks from all cluster nodes, and then restart the DOCA-OFED Driver pods on all nodes at once.

The alternative option is to perform an upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver pod restart can be done on each node individually. In this case, pods with secondary networks should be removed from the single node only. There is no need to stop pods on all nodes.

For each node, follow these steps to reload the driver on the node:

Remove pods with a secondary network from the node. Restart the DOCA-OFED Driver pod. Return the pods with a secondary network to the node.

When the DOCA-OFED Driver is ready, proceed with the same steps for other nodes.

To remove pods with a secondary network from the node with node drain, run the following command:

Copy Copied! $ kubectl drain <NODE_NAME> --pod-selector=<SELECTOR_FOR_PODS>

Warning Replace <NODE_NAME> with -l “network.nvidia.com/operator.mofed.wait=false” if you wish to drain all nodes at once.

Find the DOCA-OFED Driver pod name for the node:

Copy Copied! $ kubectl get pod -l app=mofed-<OS_NAME> -o wide -A

Example for Ubuntu 20.04:

Copy Copied! kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A

To delete the DOCA-OFED Driver pod from the node, run:

Copy Copied! $ kubectl delete pod -n <DRIVER_NAMESPACE> <DOCA_DRIVER_POD_NAME>

Warning Replace <DOCA_DRIVER_POD_NAME> with -l app=mofed-ubuntu20.04 if you wish to remove DOCA-OFED Driver pods on all nodes at once.

A new version of the DOCA-OFED Driver pod will automatically start.

After the DOCA-OFED Driver pod is ready on the node, you can make the node schedulable again.

The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule taint) the node, and return the pods to it:

Copy Copied! $ kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false"

See instructions in the Network Operator Upgrade section.