NVIDIA Network Operator v24.4.0
Network Operator Application Notes v24.4.0

Helm Chart Customization Options

Name

Type

Default

Description

operator.admissionController.enabled Bool False Deploy with admission controller
operator.admissionController.useCertManager Bool True Use cert-manager for generating self-signed certificate
operator.admissionController.certificate.tlsCrt String “” External TLS certificate. Ignored if cert-manager is used
operator.admissionController.certificate.tlsKey String “” External TLS private key. Ignored if cert-manager is used
nfd.enabled Bool True Deploy Node Feature Discovery
nfd.deployNodeFeatureRules Bool True Deploy Node Feature Rules to label the nodes
sriovNetworkOperator.enabled Bool False Deploy SR-IOV Network Operator
sriovNetworkOperator.configDaemonNodeSelectorExtra List node-role.kubernetes.io/worker: “” Additional values for SR-IOV Config Daemon nodes selector
upgradeCRDs Bool True Enable CRDs upgrade with helm pre-install and pre-upgrade hooks
operator.repository String nvcr.io/nvidia Network Operator image repository
operator.image String network-operator Network Operator image name
operator.tag String None Network Operator image tag. If set to None, the chart’s appVersion will be used
operator.imagePullSecrets List [] An optional list of references to secrets to use for pulling Network Operator image
operator.cniBinDirectory String /opt/cni/bin Directory, where CNI binaries will be deployed on the nodes. Setting for the sriov-network-operator is set with sriov-network-operator.cniBinPath parameter. Note that the CNI bin directory should be aligned with the CNI bin directory in the container runtime.
operator.resources Yaml

resources:
limits:
cpu: 500m
memory: 128Mi

requests:
cpu: 5m
memory: 64Mi

Optional resource requests and limits for the operator
imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the Network Operator image, if it is not overridden
deployCR Bool False Deploy NicClusterPolicy custom resource according to the provided parameters
nodeAffinity Yaml

requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist

- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist

Configure node affinity settings for Network Operator components
tolerations Yaml “” Set additional tolerations for various Daemonsets deployed by the network operator, e.g. whereabouts, multus, cni-plugins.
useDTK Bool True Enable the use of Driver ToolKit to compile OFED drivers (OpenShift only)

ImagePullSecrets customization

To provide imagePullSecrets` object references, you need to specify them using a following structure:

Copy
Copied!
            

imagePullSecrets: - image-pull-secret1 - image-pull-secret2

The NFD labels required by the Network Operator and GPU Operator:

Label

Location

feature.node.kubernetes.io/pci-15b3.present Nodes containing NVIDIA Networking hardware
feature.node.kubernetes.io/pci-10de.present Nodes containing NVIDIA GPU hardware

SR-IOV Network Operator Helm chart customization options can be found here. Following is a list of overriden values by NVIDIA Operator Helm Chart:

Name

Type

Default in NVIDIA Network Operator

Notes

sriov-network-operator.operator.resourcePrefix String nvidia.com
sriov-network-operator.images.operator String nvcr.io/nvidia/mellanox/sriov-network-operator:network-operator-24.4.0
sriov-network-operator.images.sriovConfigDaemon String nvcr.io/nvidia/mellanox/sriov-network-operator-config-daemon:network-operator-24.4.0
sriov-network-operator.images.sriovCni String ghcr.io/k8snetworkplumbingwg/sriov-cni:3e6368077716f6b8368b0e036a1290d1c64cf1fb For ARM-based deployments, it is recommended to use the ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.0-arm64 image
sriov-network-operator.images.ibSriovCni String ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:fc002af57a81855542759d0f77d16dacd7e1aa38 For ARM-based deployments, it is recommended to use the ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:1.1.0-arm64 image
sriov-network-operator.images.sriovDevicePlugin String ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:e6ead1e8f76a407783430ee2666b403db2d76f64 For ARM-based deployments, it is recommended to use the ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.6.2-arm64 image
sriov-network-operator.images.webhook String nvcr.io/nvidia/mellanox/sriov-network-operator-webhook:network-operator-24.4.0

Optional requests and limits can be configured for each container of the sub-resources deployed by the Network Operator by setting the parameter containerResources.

For example:

Copy
Copied!
            

containerResources: - name: "mofed-container" requests: cpu: "200m" memory: "150Mi" limits: cpu: "300m" memory: "300Mi"

Name

Type

Default

Description

ofedDriver.deploy Bool false Deploy the MLNX_OFED driver container
ofedDriver.repository String nvcr.io/nvidia/mellanox MLNX_OFED driver image repository
ofedDriver.image String doca-driver MLNX_OFED driver image name
ofedDriver.version String 24.04-0.6.6.0-0 MLNX_OFED driver version
ofedDriver.initContainer.enable Bool true Deploy init container
ofedDriver.initContainer.repository string ghcr.io/mellanox init container image repository
ofedDriver.initContainer.image string network-operator-init-container init container image name
ofedDriver.initContainer.version string v0.0.2 init container image version
ofedDriver.certConfig.name String “” Custom TLS key/certificate configuration configMap name
ofedDriver.repoConfig.name String “” Private mirror repository configuration configMap name
ofedDriver.terminationGracePeriodSeconds Int 300 NVIDIA OFED termination grace periods in seconds
ofedDriver.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the MLNX_OFED driver images
ofedDriver.env List [] An optional list of environment variables passed to the NVIDIA OFED driver image
ofedDriver.startupProbe.initialDelaySeconds Int 10 MLNX_OFED startup probe initial delay
ofedDriver.startupProbe.periodSeconds Int 20 MLNX_OFED startup probe interval
ofedDriver.livenessProbe.initialDelaySeconds Int 30 MLNX_OFED liveness probe initial delay
ofedDriver.livenessProbe.periodSeconds Int 30 MLNX_OFED liveness probe interval
ofedDriver.readinessProbe.initialDelaySeconds Int 10 MLNX_OFED readiness probe initial delay
ofedDriver.readinessProbe.periodSeconds Int 30 MLNX_OFED readiness probe interval
ofedDriver.upgradePolicy.autoUpgrade Bool true A global switch for the automatic upgrade feature. If set to false, all other options are ignored.
ofedDriver.upgradePolicy.maxParallelUpgrades Int 1 The amount of nodes that can be upgraded in parallel. 0 means no limit. All nodes will be upgraded in parallel.
ofedDriver.upgradePolicy.safeLoad Bool false Cordon and drain (if enabled) a node before loading the driver on it, requires ofedDriver.initContainer to be enabled and ofedDriver.upgradePolicy.autoUpgrade to be true
ofedDriver.upgradePolicy.drain.enable Bool true Options for node drain (kubectl drain) before driver reload, if auto upgrade is enabled.
ofedDriver.upgradePolicy.drain.force Bool true Use force drain of pods
ofedDriver.upgradePolicy.drain.podSelector String “” Pod selector to specify which pods will be drained from the node. An empty selector means all pods.
ofedDriver.upgradePolicy.drain.timeoutSeconds Int 300 Number of seconds to wait for pod eviction
ofedDriver.upgradePolicy.drain.deleteEmptyDir Bool false Delete pods local storage
ofedDriver.upgradePolicy.waitForCompletion.podSelector String Not set Specifies a label selector for the pods to wait for completion before starting the driver upgrade
ofedDriver.upgradePolicy.waitForCompletion.timeoutSeconds int Not set Specify the length of time in seconds to wait before giving up for workload to finish. Zero means infinite
ofedDriver.containerResources List Not set Optional resource requests and limits and limits for the mofed-container
ofedDriver.forcePrecompiled Bool false Fail Mellanox OFED deployment if precompiled OFED driver container image does not exists

The following are special environment variables supported by the MLNX_OFED container to configure its behavior:

Name

Default

Description

CREATE_IFNAMES_UDEV

* “true” for Ubuntu 20.04, RHEL v8.x and OCP <= v4.13.
* “false” for newer OS.

Create an udev rule to preserve “old-style” path based netdev names e.g enp3s0f0
UNLOAD_STORAGE_MODULES “false”

Unload host storage modules prior to loading MLNX_OFED modules:
* ib_isert
* nvme_rdma
* nvmet_rdma
* rpcrdma
* xprtrdma
* ib_srpt

ENABLE_NFSRDMA “false” Enable loading of NFS related storage modules from a MLNX_OFED container
RESTORE_DRIVER_ON_POD_TERMINATION “true” Restore host drivers when a container

In addition, it is possible to specify any environment variables to be exposed to the MLNX_OFED container, such as the standard “HTTP_PROXY”, “HTTPS_PROXY”, “NO_PROXY”.

Warning

CREATE_IFNAMES_UDEV is set automatically by the Network Operator, depending on the Operating System of the worker nodes in the cluster (the cluster is assumed to be homogenous).

To set these variables, change them into Helm values. For example:

Copy
Copied!
            

ofedDriver: env: - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: UNLOAD_STORAGE_MODULES value: "true" - name: CREATE_IFNAMES_UDEV value: "true"

The variables can also be configured directly via the NicClusterPolicy CRD.

Name

Type

Default

Description

rdmaSharedDevicePlugin.deploy Bool true Deploy RDMA shared device plugin
rdmaSharedDevicePlugin.repository String nvcr.io/nvidia/cloud-native RDMA shared device plugin image repository
rdmaSharedDevicePlugin.image String k8s-rdma-shared-dev-plugin RDMA shared device plugin image name
rdmaSharedDevicePlugin.version String 1.4.0 RDMA shared device plugin version
rdmaSharedDevicePlugin.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the RDMA Shared device plugin image
rdmaSharedDevicePlugin.resources List See below RDMA shared device plugin resources
rdmaSharedDevicePlugin.useCdi Bool false

Enable Container Device Interface (CDI) mode.
NOTE: NVIDIA Network Operator does not configure container runtime to enable CDI

rdmaSharedDevicePlugin.containerResources List Not set Optional resource requests and limits for the rdma-shared-dp container

These configurations consist of a list of RDMA resources, each with a name and a selector of RDMA capable network devices to be associated with the resource. Refer to RDMA Shared Device Plugin Selectors for supported selectors.

Copy
Copied!
            

resources: - name: rdma_shared_device_a vendors: [15b3] deviceIDs: [1017] ifNames: [enp5s0f0] rdmaHcaMax: 63 - name: rdma_shared_device_b vendors: [15b3] deviceIDs: [1017] ifNames: [ib0, ib1] rdmaHcaMax: 63

Name

Type

Default

Description

sriovDevicePlugin.deploy Bool false Deploy SR-IOV Network device plugin
sriovDevicePlugin.repository String ghcr.io/k8snetworkplumbingwg SR-IOV Network device plugin image repository
sriovDevicePlugin.image String sriov-network-device-plugin SR-IOV Network device plugin image name
sriovDevicePlugin.version String e6ead1e8f76a407783430ee2666b403db2d76f64

SR-IOV Network device plugin version
For ARM-based deployments, it is recommended to use the ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.6.2-amd64 image

sriovDevicePlugin.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the SR-IOV Network device plugin image
sriovDevicePlugin.resources List See below SR-IOV Network device plugin resources
sriovDevicePlugin.useCdi Bool false

Enable Container Device Interface (CDI) mode.
NOTE: NVIDIA Network Operator does not configure container runtime to enable CD.

sriovDevicePlugin.containerResources List Not set Optional resource requests and limits for the kube-sriovdp container

Consists of a list of RDMA resources, each with a name and a selector of RDMA capable network devices to be associated with the resource. Refer to SR-IOV Network Device Plugin Selectors for supported selectors.

Copy
Copied!
            

resources: - name: hostdev vendors: [15b3] - name: ethernet_rdma vendors: [15b3] linkTypes: [ether] - name: sriov_rdma vendors: [15b3] devices: [1018] drivers: [mlx5_ib]

ib-kubernetes provides a daemon that works in conjunction with the SR-IOV Network Device Plugin. It acts on Kubernetes pod object changes (Create/Update/Delete), reading the pod’s network annotation, fetching its corresponding network CRD and reading the PKey. This is done in order to add the newly generated GUID or the predefined GUID in the GUID field of the CRD cni-args to that PKey for pods with mellanox.infiniband.app annotation.

Name

Type

Default

Description

ibKubernetes.deploy bool false Deploy IB Kubernetes
ibKubernetes.repository string ghcr.io/mellanox IB Kubernetes image repository
ibKubernetes.image string ib-kubernetes IB Kubernetes image name
ibKubernetes.version string v1.0.2 IB Kubernetes version
ibKubernetes.imagePullSecrets list [] An optional list of references to secrets used for pulling any of the IB Kubernetes images
ibKubernetes.periodicUpdateSeconds int 5 Interval of periodic update in seconds
ibKubernetes.pKeyGUIDPoolRangeStart string 02:00:00:00:00:00:00:00 Minimal available GUID value to be allocated for the pod
ibKubernetes.pKeyGUIDPoolRangeEnd string 02:FF:FF:FF:FF:FF:FF:FF Maximal available GUID value to be allocated for the pod
ibKubernetes.ufmSecret string See below Name of the Secret with the NVIDIA UFM access credentials, deployed in advance
ibKubernetes.containerResources List Not set Optional resource requests and limits for the ib-kubernetes container

IB Kubernetes must access NVIDIA UFM in order to manage pods’ GUIDs. To provide its credentials, the secret of the following format should be deployed in advance:

Copy
Copied!
            

apiVersion: v1 kind: Secret metadata: name: ib-kubernetes-ufm-secret namespace: nvidia-network-operator stringData: UFM_USERNAME: "admin" UFM_PASSWORD: "123456" UFM_ADDRESS: "ufm-hostname" UFM_HTTP_SCHEMA: "" UFM_PORT: "" data: UFM_CERTIFICATE: ""

Warning

The InfiniBand Fabric manages a single pool of GUIDs. In order to use IB Kubernetes in different clusters, different GUID ranges must be specified to avoid collisions.

Name

Type

Default

Description

secondaryNetwork.deploy Bool true Deploy Secondary Network

Specifies components to deploy in order to facilitate a secondary network in Kubernetes. It consists of the following optionally deployed components:

  • Multus-CNI: Delegate CNI plugin to support secondary networks in Kubernetes

  • CNI plugins: Currently only containernetworking-plugins is supported

  • IPAM CNI: Currently only Whereabout IPAM CNI is supported as a part of the secondaryNetwork section. NVIDIA-IPAM is configured separately.

  • IPoIB CNI: Allows the user to create IPoIB child link and move it to the pod

Name

Type

Default

Description

secondaryNetwork.cniPlugins.deploy Bool true Deploy CNI Plugins Secondary Network
secondaryNetwork.cniPlugins.image String plugins CNI Plugins image name
secondaryNetwork.cniPlugins.repository String ghcr.io/k8snetworkplumbingwg CNI Plugins image repository
secondaryNetwork.cniPlugins.version String v1.3.0-amd64 CNI Plugins image version
secondaryNetwork.cniPlugins.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the CNI Plugins images
secondaryNetwork.cniPlugins.containerResources List Not set Optional resource requests and limits for the cni-plugins container

Name

Type

Default

Description

secondaryNetwork.multus.deploy Bool true Deploy Multus Secondary Network
secondaryNetwork.multus.image String multus-cni Multus image name
secondaryNetwork.multus.repository String ghcr.io/k8snetworkplumbingwg Multus image repository
secondaryNetwork.multus.version String v3.9.3 Multus image version
secondaryNetwork.multus.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the Multus images
secondaryNetwork.multus.config String “” Multus CNI config. If empty, the config will be automatically generated from the CNI configuration file of the master plugin (the first file in lexicographical order in the cni-confg-dir).
secondaryNetwork.multus.containerResources List Not set Optional resource requests and limits for the kube-multus container

Name

Type

Default

Description

secondaryNetwork.ipoib.deploy Bool false Deploy IPoIB CNI
secondaryNetwork.ipoib.image String ipoib-cni IPoIB CNI image name
secondaryNetwork.ipoib.repository String “” IPoIB CNI image repository
secondaryNetwork.ipoib.version String 428715a57c0b633e48ec7620f6e3af6863149ccf IPoIB CNI image version
secondaryNetwork.ipoib.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the IPoIB CNI images
secondaryNetwork.ipoib.containerResources List Not set Optional resource requests and limits for the ipoib-cni container

Name

Type

Default

Description

secondaryNetwork.ipamPlugin.deploy Bool true Deploy IPAM CNI Plugin Secondary Network
secondaryNetwork.ipamPlugin.image String whereabouts IPAM CNI Plugin image name
secondaryNetwork.ipamPlugin.repository String ghcr.io/k8snetworkplumbingwg IPAM CNI Plugin image repository
secondaryNetwork.ipamPlugin.version String v0.7.0-amd64 IPAM CNI Plugin image version
secondaryNetwork.ipamPlugin.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the IPAM CNI Plugin images
secondaryNetwork.ipamPlugin.containerResources List Not set Optional resource requests and limits for the whereabouts container

NVIDIA IPAM Plugin is recommended to be used on large-scale deployments of the NVIDIA Network Operator.

Name

Type

Default

Description

nvIpam.deploy Bool false Deploy NVIDIA IPAM Plugin
nvIpam.image String nvidia-k8s-ipam NVIDIA IPAM Plugin image name
nvIpam.repository String ghcr.io/mellanox NVIDIA IPAM Plugin image repository
nvIpam.version String v0.1.2 NVIDIA IPAM Plugin image version
nvIpam.imagePullSecrets List [] An optional list of references to secrets to use for pulling any of the Plugin images
nvIpam.enableWebhook Bool false Enable deployment of the validataion webhook for IPPool CRD
nvIpam.containerResources List Not set Optional resource requests and limits for the nv-ipam-node and nv-ipam-controller containers
Warning

Supported X.509 certificate management system should be available in the cluster to enable the validation webhook. Currently, the supported systems are certmanager and Openshift certificate management.

NVIDIA NIC Feature Discovery leverages Node Feature Discovery to advertise NIC specific labels on K8s Node objects.

Name

Type

Default

Description

nicFeatureDiscovery.deploy Bool false Deploy NVIDIA NIC Feature Discovery
nicFeatureDiscovery.image String nic-feature-discovery NVIDIA NIC Feature Discovery image name
nicFeatureDiscovery.repository String ghcr.io/mellanox NVIDIA NIC Feature Discovery repository
nicFeatureDiscovery.version String v0.0.1 NVIDIA NIC Feature Discovery image version
nicFeatureDiscovery.containerResources List Not set Optional resource requests and limits for the nic-feature-discovery container

DOCA Telemetry Service exports metrics from NVIDIA NICs on K8s Nodes.

Name

Type

Default

Description

docaTelemetryService.deploy Bool false Deploy DOCA Telemetry Service
docaTelemetryService.image String doca_telemetry DOCA Telemetry Service image name
docaTelemetryService.repository String nvcr.io/nvidia/doca DOCA Telemetry Service image repository
docaTelemetryService.version String 1.16.5-doca2.6.0-host DOCA Telemetry Service image version
docaTelemetryService.containerResources List Not set Optional resource requests and limits for the doca-telemetry-service container
Warning

Since several parameters should be provided when creating custom resources during operator deployment, it is recommended to use a configuration file. While it is possible to override the parameters via CLI, we recommend to avoid the use of CLI arguments in favor of a configuration file.

Copy
Copied!
            

$ helm install -f ./values.yaml -n nvidia-network-operator --create-namespace --wait nvidia/network-operator network-operator

Previous Customization Options
Next CRDs
© Copyright 2024, NVIDIA. Last updated on Jun 3, 2024.