Release Notes

Cloud Orchestration - Kubernetes Application Notes 23.7.0

Version

Description

23.7.0

Added support for OpenShift Container Platform 4.13.

Added support for RHEL 9.1 and 9.2 with CRI-O container runtime (Beta).

Added support for NodeFeatureApi in Node Feature Discovery.

23.5.0

Added support for NVIDIA IPAM Plugin deployment.

Added support for CRDs upgrade during NVIDIA Network Operator installation or upgrade.

23.4.0

Added support for Kubernetes >= 1.21 and <=1.27.

Added support for NicClusterPolicy update and removal.

Added support for OpenShift Container Platform 4.11 and 4.12.

23.4.0

  • Added a calendar versioning schema for Network Operator releases to better align with the NVIDIA GPU Operator.

  • Added support for the following operating systems and Kubernetes environments:

    • RHEL 8.4 and 8.6 with CRI-O container runtime

    • Kubernetes >= 1.21 and <=1.26

  • Added PKey configuration for IB networks with IB-Kubernetes.

  • Added the ability to gracefully terminate the OFED container on DGX systems running Red Hat OpenShift.

1.4.0

Added support for Kubernetes >= 1.21 and <=1.25.

Added support for Ubuntu 22.04.

Added support for OpenShift Container Platform 4.11 including DGX platform.

Added Beta support for PKey configuration for IB networks with IB-Kubernetes.

1.3.0

Added support for Kubernetes >= 1.17 and <=1.24.

Added the option to use a single namespace to deploy Network Operator components.

Added support for automatic MLNX OFED driver upgrade.

Added support for IPoIB CNI.

Added support for Air Gap deployment.

1.2.0

Added support for OpenShift Container Platform 4.10.

Added extended selectors support for SR-IOV Device Plugin resources with Helm chart.

Added WhereAbouts IP reconciler support.

Added BlueField2 NICs support for SR-IOV operator.

1.1.0

Added support for OpenShift Container Platform 4.9.

Added support for Network Operator upgrade from v1.0.0.

Added support for Kubernetes POD Security Policy.

Added support for Kubernetes >= 1.17 and <=1.22.

Added the ability to propagate nodeAffinity property from the NicClusterPolicy to Network Operator dependencies.

1.0.0

Added Node Feature Discovery that can be used to mark nodes with NVIDIA SR-IOV NICs.

Added support for different networking models:

  • Macvlan Network

  • HostDevice Network

  • SR-IOV Network

Added Kubernetes cluster scale-up support.

Published Network Operator image at NGC.

Added support for Kubernetes >= 1.17 and <=1.21.

Upgrade Notes

Version

Notes

23.7.0

  • Dropped MLNX_OFED support for versions older than 5.7-0.1.2.0.

  • Removed nv-peer-mem support in favor of nvidia-peer-mem.

1.3.0

The option of manual gradual upgrade is not supported when upgrading to Network Operator v1.3.0, since all pods are dropped/restarted in case components are deployed into the single namespace when the old namespace is deleted. This could lead to networking connectivity issues during the upgrade procedure.

1.2.0

  • Network Operator 1.2.0 deploys the NVIDIA MLNX_OFED 5.6 driver container by default. When deployed, depending on your system kernel and OS configuration, the network device name may change, as it no longer installs an udev rule to force network device naming scheme. Instead, the default setting uses the name already configured in the system by either systemd.network or any pre-existing udev rules (e.g enp3s0f0 netdev will change to enp3s0f0np0). If that is the case in your system, please make sure to update the following:

    • The master network device name in your MacvlanNetwork

    • The ifNames selector, if used in RDMA shared device plugin resource configuration

    • The pfNames selector, if used in SR-IOV device plugin configuration
    • If the sriov-network-operator is used, any instance of SriovNetworkNodePolicy which utilizes NicSelector.PfNames field should be updated to the new network device name.

  • When Network Operator 1.2.0 is installed via Helm, it no longer deploys both RDMA shared device plugin and SR-IOV network device plugin by default, as it may cause the same device to be registered to two different device plugins. This is an undesirable behavior. Instead, by default, only RDMA shared device plugin is deployed via Helm.

    If you wish to deploy both device plugins, set the ` sriovDevicePlugin.deploy` Helm parameter to "true".

1.1.0

N/A

1.0.0

N/A


System Requirements

  • RDMA capable hardware: NVIDIA ConnectX-5 NIC or newer, NVIDIA BlueField-2 DPUs or newer

  • NVIDIA GPU Operator Version 23.6.0 or newer (required for the workloads using NVIDIA GPUs and GPUDirect RDMA technology)

  • Operating Systems:

    • Ubuntu: v22.04, v20.04

    • OpenShift Container Platform (OCP): v4.13, v4.12

    • RHEL: v9.2, v9.1, v8.6, v8.4

  • Container runtime: containerd, CRI-O

Tested Network Adapters

The following network adapters have been tested with the Network Operator:

  • ConnectX-6 Dx

  • ConnectX-7

  • BlueField-2 NIC Mode

Prerequisites

Component

Version

Notes

Kubernetes

>=1.24 and <=1.27

-

Helm

v.3.5+

For information and methods of Helm installation, please refer to the official Helm Website.


Component Versions

The following component versions are deployed by the Network Operator:

Component

Version

Comments

Node Feature Discovery

v0.13.2

Optionally deployed. May already be present in the cluster with proper configuration.

NVIDIA MLNX_OFED driver container

23.07-0.5.0.0

-

k8s-rdma-shared-device-plugin

v1.3.2

-

sriov-network-device-plugin

7e7f979087286ee950bd5ebc89d8bbb6723fc625

-

containernetworking CNI plugins

v1.2.0

-

whereabouts CNI

v0.5.1

-

multus CNI

v3.9.3

-

IPoIB CNI

v1.1.0

-

IB Kubernetes

v1.0.2

-

NV IPAM Plugin

v0.3.0

-

Bug Fixes

Version

Description

1.4.0

Fixed a cluster scale-up issue.

Fixed an issue with IPoIB CNI deployment in OCP.

1.3.0

N/A

1.2.0

N/A

1.1.0

Fixed the Whereabouts IPAM plugin to work with Kubernetes v1.22.

Fixed imagePullSecrets for Network Operator.

Enabled resource names for HostDeviceNetwork to be accepted both with and without a prefix.

Version

Description

All

MOFED container builds and loads the driver on every MOFED Pod startup to support the current OS kernel.

23.4.0

In case that the UNLOAD_STORAGE_MODULES parameter is enabled for MOFED container deployment, it is required to make sure that the relevant storage modules are not in use in the OS.

23.1.0

Only a single PKey can be configured per IPoIB workload pod.

1.4.0

The operator upgrade procedure does not reflect configuration changes. The RDMA Shared Device Plugin or SR-IOV Device Plugin should be restarted manually in case of configuration changes.

The RDMA subsystem could be exclusive or shared only in one cluster. Mixed configuration is not supported. The RDMA Shared Device Plugin requires shared RDMA subsystem.

1.3.0

MOFED container is not a supported configuration on the DGX platform.

MOFED container deletion may lead to the driver's unloading: In this case, the mlx5_core kernel driver must be reloaded manually. Network connectivity could be affected if there are only NVIDIA NICs on the node.

1.2.0

N/A

1.1.0

NicClusterPolicy update is not supported at the moment.

Network Operator is compatible only with NVIDIA GPU Operator v1.9.0 and above.

GPUDirect could have performance degradation if it is used with servers which are not optimized. Please see official GPUDirect documentation here.

Persistent NICs configuration for netplan or ifupdown scripts is required for SR-IOV and Shared RDMA interfaces on the host.

POD Security Policy admission controller should be enabled to use PSP with Network Operator. Please see Deployment with Pod Security Policy in the Network Operator Documentation for details.

1.0.0

Network Operator is only compatible with NVIDIA GPU Operator v1.5.2 and above.

Persistent NICs configuration for netplan or ifupdown scripts is required for SR-IOV and Shared RDMA interfaces on the host.

© Copyright 2023, NVIDIA. Last updated on Oct 15, 2023.