Network Operator Deployment Guide
The Network Operator Release Notes chapter is available here.
NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking related components, in order to enable fast networking, RDMA and GPUDirect for workloads in a Kubernetes cluster. The Network Operator works in conjunction with the GPU-Operator to enable GPU-Direct RDMA on compatible systems.
The goal of the Network Operator is to manage the networking related components, while enabling execution of RDMA and GPUDirect RDMA workloads in a Kubernetes cluster. This includes:
- NVIDIA Networking drivers to enable advanced features
- Kubernetes device plugins to provide hardware resources required for a fast network
- Kubernetes secondary network components for network intensive workloads
Network Operator Deployment on Vanilla K8s Cluster
The default installation via Helm as described below will deploy the Network Operator and related CRDs, after which an additional step is required to create a NicClusterPolicy
custom resource with the configuration that is desired for the cluster. Please refer to the NicClusterPolicy
CRD Section for more information on manual Custom Resource creation.
The provided Helm chart contains various parameters to facilitate the creation of a NicClusterPolicy custom resource upon deployment.
Each Operator Release has a set of default version values for the various components it deploys. It is recommended that these values will not be changed. Testing and validation were performed with these values, and there is no guarantee of interoperability nor correctness when different versions are used.
Network Operator Deployment from NGC
To install the operator with chart default values, run:
# Download Helm chart $ helm fetch https://helm.ngc.nvidia.com/nvidia/cloud-native/charts/network-operator-1.4.0.tgz $ ls network-operator-*.tgz | xargs -n 1 tar xf # Install Operator $ helm install -n network-operator --create-namespace network-operator ./network-operator # View deployed resources $ kubectl -n network-operator get pods
Helm Chart Customization Options
In order to tailor the deployment of the Network Operator to your cluster needs, use the following parameters:
General Parameters
Name | Type | Default | Description |
---|---|---|---|
nfd.enabled | Bool | True | Deploy Node Feature Discovery. |
sriovNetworkOperator.enabled | Bool | False | Deploy SR-IOV Network Operator. |
psp.enabled | Bool | False | Deploy Pod Security Policy. |
operator.repository | String | nvcr.io/nvidia/cloud-native | Network Operator image repository. |
operator.image | String | network-operator | Network Operator image name. |
operator.tag | String | None | Network Operator image tag. If set to None , the chart's appVersion will be used. |
operator.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the Network Operator image. |
deployCR | Bool | false | Deploy NicClusterPolicy custom resource according to the provided parameters. |
NicClusterPolicy Custom Resource Parameters
MLNX_OFED Driver
Name | Type | Default | Description |
---|---|---|---|
ofedDriver.deploy | Bool | false | Deploy the NVIDIA MLNX_OFED driver container |
ofedDriver.repository | String | nvcr.io/nvidia/mellanox | NVIDIA OFED driver image repository |
ofedDriver.image | String | mofed | NVIDIA OFED driver image name |
ofedDriver.version | String | 5.9-0.5.6.0 | NVIDIA OFED driver version |
ofedDriver.env | List | [] | An optional list of environment variables passed to the Mellanox OFED driver image |
ofedDriver.terminationGracePeriodSeconds | Int | 300 | NVIDIA OFED termination grace period in seconds |
ofedDriver.repoConfig.name | String | "" | Private mirror repository configuration configMap name |
ofedDriver.certConfig.name | String | "" | Custom TLS key/certificate configuration configMap name |
ofedDriver.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the NVIDIA OFED driver images |
ofedDriver.startupProbe.initialDelaySeconds | Int | 10 | NVIDIA OFED startup probe initial delay |
ofedDriver.startupProbe.periodSeconds | Int | 20 | NVIDIA OFED startup probe interval |
ofedDriver.livenessProbe.initialDelaySeconds | Int | 30 | NVIDIA OFED liveness probe initial delay |
ofedDriver.livenessProbe.periodSeconds | Int | 30 | NVIDIA OFED liveness probe interval |
ofedDriver.readinessProbe.initialDelaySeconds | Int | 10 | NVIDIA OFED readiness probe initial delay |
ofedDriver.readinessProbe.periodSeconds | Int | 30 | NVIDIA OFED readiness probe interval |
NVIDIA Peer Memory Driver
Name | Type | Default | Description |
---|---|---|---|
nvPeerDriver.deploy | Bool | false | Deploy NVIDIA Peer memory driver container |
nvPeerDriver.repository | String | mellanox | NVIDIA Peer memory driver image repository |
nvPeerDriver.image | String | nv-peer-mem-driver | NVIDIA Peer memory driver image name |
nvPeerDriver.version | String | 1.1-0 | NVIDIA Peer memory driver version |
nvPeerDriver.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the NVIDIA Peer memory driver images |
nvPeerDriver.gpuDriverSourcePath | String | /run/nvidia/driver | GPU driver sources root filesystem path (usually used in tandem with gpu-operator) |
RDMA Shared Device Plugin
Name | Type | Default | Description |
---|---|---|---|
rdmaSharedDevicePlugin.deploy | Bool | true | Deploy RDMA shared device plugin |
rdmaSharedDevicePlugin.repository | String | nvcr.io/nvidia/cloud-native | RDMA shared device plugin image repository |
rdmaSharedDevicePlugin.image | String | k8s-rdma-shared-dev-plugin | RDMA shared device plugin image name |
rdmaSharedDevicePlugin.version | String | v1.3.2 | RDMA shared device plugin version |
rdmaSharedDevicePlugin.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the RDMA Shared device plugin image |
rdmaSharedDevicePlugin.resources | List | See below | RDMA shared device plugin resources |
RDMA Device Plugin Resource Configurations
Consists of a list of RDMA resources, each with a name and a selector of RDMA capable network devices to be associated with the resource. Refer to RDMA Shared Device Plugin Selectors for supported selectors.
resources: - name: rdma_shared_device_a vendors: [15b3] deviceIDs: [1017] ifNames: [enp5s0f0] - name: rdma_shared_device_b vendors: [15b3] deviceIDs: [1017] ifNames: [enp4s0f0, enp4s0f1]
SR-IOV Network Device Plugin
Name | Type | Default | Description |
---|---|---|---|
sriovDevicePlugin.deploy | Bool | false | Deploy SR-IOV Network device plugin |
sriovDevicePlugin.repository | String | ghcr.io/k8snetworkplumbingwg | SR-IOV Network device plugin image repository |
sriovDevicePlugin.image | String | sriov-network-device-plugin | SR-IOV Network device plugin image name |
sriovDevicePlugin.version | String | v3.5.1 | SR-IOV Network device plugin version |
sriovDevicePlugin.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the SR-IOV Network device plugin image |
sriovDevicePlugin.resources | List | See below | SR-IOV Network device plugin resources |
SR-IOV Network Device Plugin Resource Configuration
Consists of a list of RDMA resources, each with a name and a selector of RDMA capable network devices to be associated with the resource. Refer to SR-IOV Network Device Plugin Selectors for supported selectors.
resources: - name: hostdev vendors: [15b3] - name: ethernet_rdma vendors: [15b3] linkTypes: [ether] - name: sriov_rdma vendors: [15b3] devices: [1018] drivers: [mlx5_ib]
IB Kubernetes
ib-kubernetes provides a daemon that works in conjunction with the SR-IOV Network Device Plugin. It acts on Kubernetes pod object changes (Create/Update/Delete), reading the pod's network annotation, fetching its corresponding network CRD and reading the PKey. This is done in order to add the newly generated GUID or the predefined GUID in the GUID field of the CRD cni-args to that PKey for pods with mellanox.infiniband.app.
annotation.
Name | Type | Default | Description |
---|---|---|---|
ibKubernetes.deploy | bool | false | Deploy IB Kubernetes |
ibKubernetes.repository | string | ghcr.io/mellanox | IB Kubernetes image repository |
ibKubernetes.image | string | ib-kubernetes | IB Kubernetes image name |
ibKubernetes.version | string | v1.0.2 | IB Kubernetes version |
ibKubernetes.imagePullSecrets | list | [] | An optional list of references to secrets to use for pulling any of the IB Kubernetes image |
ibKubernetes.periodicUpdateSeconds | int | 5 | Interval of periodic update in seconds |
ibKubernetes.pKeyGUIDPoolRangeStart | string | 02:00:00:00:00:00:00:00 | Minimal available GUID value to be allocated for the pod |
ibKubernetes.pKeyGUIDPoolRangeEnd | string | 02:FF:FF:FF:FF:FF:FF:FF | Maximal available GUID value to be allocated for the pod |
ibKubernetes.ufmSecret | string | See below | Name of the Secret with the NVIDIA® UFM® access credentials, deployed beforehand |
UFM Secret
IB Kubernetes must access NVIDIA® UFM® in order to manage pods' GUIDs. To provide its credentials, the secret of the following format should be deployed in advance:
apiVersion: v1 kind: Secret metadata: name: ufm-secret namespace: kube-system stringData: UFM_USERNAME: "admin" UFM_PASSWORD: "123456" UFM_ADDRESS: "ufm-hostname" UFM_HTTP_SCHEMA: "" UFM_PORT: "" data: UFM_CERTIFICATE: ""
Note: InfiniBand Fabric manages a single pool of GUIDs. In order to use IB Kubernetes in different clusters, different GUID ranges must be specified to avoid collisions.
Secondary Network
Name | Type | Default | Description |
---|---|---|---|
secondaryNetwork.deploy | bool | true | Deploy Secondary Network |
Specifies components to deploy in order to facilitate a secondary network in Kubernetes. It consists of the following optionally deployed components:
- Multus-CNI: Delegate CNI plugin to support secondary networks in Kubernetes
- CNI plugins: Currently only containernetworking-plugins is supported
- IPAM CNI: Currently only Whereabout IPAM CNI is supported
- IPoIB CNI: Allow the user to create IPoIB child link and move it to the pod
CNI Plugin
Name | Type | Default | Description |
---|---|---|---|
secondaryNetwork.cniPlugins.deploy | Bool | true | Deploy CNI Plugins Secondary Network |
secondaryNetwork.cniPlugins.image | String | plugins | CNI Plugins image name |
secondaryNetwork.cniPlugins.repository | String | ghcr.io/k8snetworkplumbingwg | CNI Plugins image repository |
secondaryNetwork.cniPlugins.version | String | v0.8.7-amd64 | CNI Plugins image version |
secondaryNetwork.cniPlugins.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the CNI Plugins images |
Multus CNI
Name | Type | Default | Description |
---|---|---|---|
secondaryNetwork.multus.deploy | Bool | true | Deploy Multus Secondary Network |
secondaryNetwork.multus.image | String | multus-cni | Multus image name |
secondaryNetwork.multus.repository | String | ghcr.io/k8snetworkplumbingwg | Multus image repository |
secondaryNetwork.multus.version | String | v3.8 | Multus image version |
secondaryNetwork.multus.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the Multus images |
secondaryNetwork.multus.config | String | `` | Multus CNI config. If empty, the config will be automatically generated from the CNI configuration file of the master plugin (the first file in lexicographical order in the cni-confg-dir). |
IPoIB CNI
Name | Type | Default | Description |
---|---|---|---|
secondaryNetwork.ipoib.deploy | Bool | false | Deploy IPoIB CNI |
secondaryNetwork.ipoib.image | String | ipoib-cni | IPoIB CNI image name |
secondaryNetwork.ipoib.repository | String | IPoIB CNI image repository | |
secondaryNetwork.ipoib.version | String | v1.1.0 | IPoIB CNI image version |
secondaryNetwork.ipoib.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the IPoIB CNI images |
IPAM CNI Plugin
Name | Type | Default | Description |
---|---|---|---|
secondaryNetwork.ipamPlugin.deploy | Bool | true | Deploy IPAM CNI Plugin Secondary Network |
secondaryNetwork.ipamPlugin.image | String | whereabouts | IPAM CNI Plugin image name |
secondaryNetwork.ipamPlugin.repository | String | ghcr.io/k8snetworkplumbingwg | IPAM CNI Plugin image repository |
secondaryNetwork.ipamPlugin.version | String | v0.5.4-amd64 | IPAM CNI Plugin image version |
secondaryNetwork.ipamPlugin.imagePullSecrets | List | [] | An optional list of references to secrets to use for pulling any of the IPAM CNI Plugin image |
Since several parameters should be provided when creating custom resources during operator deployment, it is recommended to use a configuration file. While it is possible to override the parameters via CLI, we recommend to avoid the use of CLI arguments in favor of a configuration file.
$ helm install -f ./values.yaml -n network-operator --create-namespace --wait NVIDIA/network-operator network-operator
By default, the Network Operator deploys the Node Feature Discovery (NFD), in order to perform node labeling in the cluster. This allows proper scheduling of Network Operator resources.
If the nodes have already been labeled by other means, it is possible to disable the deployment of the NFD by setting the nfd.enabled=false
chart parameter:
$ helm install --set nfd.enabled=false -n network-operator --create-namespace --wait network-operator NVIDIA/network-operator
Currently, the following NFD labels are used:
Label | Location |
---|---|
Nodes containing NVIDIA Networking hardware | |
Nodes containing NVIDIA GPU hardware |
The labels which the Network Operator depends on may change between releases.
Deployment with Pod Security Policy
This section applies to Kubernetes v1.24 or earlier versions only.
A Pod Security Policy is a cluster-level resource that controls security sensitive aspects of the pod specification. The PodSecurityPolicy
objects define a set of conditions that a pod must run with in order to be accepted into the system, as well as defaults for the related fields.
By default, the NVIDIA Network Operator does not deploy pod Security Policy. To do that, override the PSP chart parameter:
$ helm install -n network-operator --create-namespace --wait network-operator NVIDIA/network-operator --set psp.enabled=true
To enforce Pod Security Policies, PodSecurityPolicy
admission controller must be enabled. For instructions, refer to this article in Kubernetes Documentation.
The NVIDIA Network Operator deploys a privileged Pod Security Policy, which provides the operator’s pods the following permissions:
privileged: true hostIPC: false hostNetwork: true hostPID: false allowPrivilegeEscalation: true readOnlyRootFilesystem: false allowedHostPaths: [] allowedCapabilities: - '*' fsGroup: rule: RunAsAny runAsUser: rule: RunAsAny seLinux: rule: RunAsAny supplementalGroups: rule: RunAsAny volumes: - configMap - hostPath - secret - downwardAPI
PodSecurityPolicy
is deprecated as of Kubernetes v1.21 and will be removed in v1.25.
Network Operator Deployment with Pod Security Admission
The Pod Security admission controller replaces PodSecurityPolicy, enforcing predefined Pod Security Standards by adding a label to a namespace.
There are three levels defined by the Pod Security Standards: privileged
, baseline
, and restricted
.
In case you want to enforce a PSA to the Network Operator namespace, the privileged
level is required. Enforcing baseline
or restricted
levels will prevent creation of required Network Operator pods.
If required, enforce PSA privileged
level on the Network Operator namespace by running:
$ kubectl label --overwrite ns network-operator pod-security.kubernetes.io/enforce=privileged
In case that baseline
or restricted
levels are being enforced on the Network Operator namespace, events for pods creation failures will be triggered:
$ kubectl get events -n network-operator --field-selector reason=FailedCreate LAST SEEN TYPE REASON OBJECT MESSAGE 2m36s Warning FailedCreate daemonset/mofed-ubuntu22.04-ds Error creating: pods "mofed-ubuntu22.04-ds-rwmgs" is forbidden: violates PodSecurity "baseline:latest": host namespaces (hostNetwork=true), hostPath volumes (volumes "run-mlnx-ofed", "etc-network", "host-etc", "host-usr", "host-udev"), privileged (container "mofed-container" must not set securityContext.privileged=true)
Network Operator Deployment in Proxy Environment
This section describes how to successfully deploy the Network Operator in clusters behind an HTTP Proxy. By default, the Network Operator requires internet access for the following reasons:
- Container images must be pulled during the GPU Operator installation.
- The driver container must download several OS packages prior to the driver installation.
To address these requirements, all Kubernetes nodes, as well as the driver container, must be properly configured in order to direct traffic through the proxy.
This section demonstrates how to configure the GPU Operator, so that the driver container could successfully download packages behind an HTTP proxy. Since configuring Kubernetes/container runtime components for proxy use is not specific to the Network Operator, those instructions are not detailed here.
If you are not running Openshift, please skip the section titled HTTP Proxy Configuration for Openshift, as Opneshift configuration instructions are different.
Prerequisites
Kubernetes cluster is configured with HTTP proxy settings (container runtime should be enabled with HTTP proxy).
HTTP Proxy Configuration for Openshift
For Openshift, it is recommended to use the cluster-wide Proxy object to provide proxy information for the cluster. Please follow the procedure described in Configuring the Cluster-wide Proxy via the Red Hat Openshift public documentation. The GPU Operator will automatically inject proxy related ENV into the driver container, based on the information present in the cluster-wide Proxy object.
HTTP Proxy Configuration
Specify the ofedDriver.env
in your values.yaml
file with appropriate HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables (in both uppercase and lowercase).
ofedDriver: env: - name: HTTPS_PROXY value: http://<example.proxy.com:port> - name: HTTP_PROXY value: http://<example.proxy.com:port> - name: NO_PROXY value: <example.com> - name: https_proxy value: http://<example.proxy.com:port> - name: http_proxy value: http://<example.proxy.com:port> - name: no_proxy value: <example.com>
Network Operator Deployment in Air-gapped Environment
This section describes how to successfully deploy the Network Operator in clusters with restricted internet access. By default, the Network Operator requires internet access for the following reasons:
- The container images must be pulled during the Network Operator installation.
- The OFED driver container must download several OS packages prior to the driver installation.
To address these requirements, it may be necessary to create a local image registry and/or a local package repository, so that the necessary images and packages will be available for your cluster. Subsequent sections of this document detail how to configure the Network Operator to use local image registries and local package repositories. If your cluster is behind a proxy, follow the steps listed in Network Operator Deployment in Proxy Environments.
Local Image Registry
Without internet access, the Network Operator requires all images to be hosted in a local image registry that is accessible to all nodes in the cluster. To allow Network Operator to work with a local registry, users can specify local repository, image, tag along with pull secrets in the values.yaml file.
Pulling and Pushing Container Images to a Local Registry
To pull the correct images from the NVIDIA registry, you can leverage the fields repository, image and version specified in the values.yaml file.
Local Package Repository
The OFED driver container deployed as part of the Network Operator requires certain packages to be available as part of the driver installation. In restricted internet access or air-gapped installations, users are required to create a local mirror repository for their OS distribution, and make the following packages available:
ubuntu: linux-headers-${KERNEL_VERSION} linux-modules-${KERNEL_VERSION} rhcos: kernel-headers-${KERNEL_VERSION} kernel-devel-${KERNEL_VERSION} kernel-core-${KERNEL_VERSION} createrepo elfutils-libelf-devel kernel-rpm-macros numactl-libs
For Ubuntu, these packages can be found at archive.ubuntu.com, and be used as the mirror that must be replicated locally for your cluster. By using apt-mirror
or apt-get download
, you can create a full or a partial mirror to your repository server.
For RHCOS, dnf reposync
can be used to create the local mirror. This requires an active Red Hat subscription for the supported OpenShift version. For example:
dnf --releasever=8.4 reposync --repo rhel-8-for-x86_64-appstream-rpms --download-metadata
Once all the above required packages are mirrored to the local repository, repo lists must be created following distribution specific documentation. A ConfigMap containing the repo list file should be created in the namespace where the GPU Operator is deployed.
Following is an example of a repo list for Ubuntu 20.04 (access to a local package repository via HTTP):
custom-repo.list:
deb [arch=amd64 trusted=yes] http://<local pkg repository>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal main universe deb [arch=amd64 trusted=yes] http://<local pkg repository>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal-updates main universe deb [arch=amd64 trusted=yes] http://<local pkg repository>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal-security main universe
Following is an example of a repo list for RHCOS (access to a local package repository via HTTP):
cuda.repo (A mirror of https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64):
[cuda] name=cuda baseurl=http://<local pkg repository>/cuda priority=0 gpgcheck=0 enabled=1
redhat.repo:
[baseos] name=rhel-8-for-x86_64-baseos-rpms baseurl=http://<local pkg repository>/rhel-8-for-x86_64-baseos-rpms gpgcheck=0 enabled=1 [baseoseus] name=rhel-8-for-x86_64-baseos-eus-rpms baseurl=http://<local pkg repository>/rhel-8-for-x86_64-baseos-eus-rpms gpgcheck=0 enabled=1 [rhocp] name=rhocp-4.10-for-rhel-8-x86_64-rpms baseurl=http://<10.213.6.61:81/rhocp-4.10-for-rhel-8-x86_64-rpms gpgcheck=0 enabled=1 [apstream] name=rhel-8-for-x86_64-appstream-rpms baseurl=http://<local pkg repository>/rhel-8-for-x86_64-appstream-rpms gpgcheck=0 enabled=1
ubi.repo:
[ubi-8-baseos] name = Red Hat Universal Base Image 8 (RPMs) - BaseOS baseurl = http://<local pkg repository>/ubi-8-baseos enabled = 1 gpgcheck = 0 [ubi-8-baseos-source] name = Red Hat Universal Base Image 8 (Source RPMs) - BaseOS baseurl = http://<local pkg repository>/ubi-8-baseos-source enabled = 0 gpgcheck = 0 [ubi-8-appstream] name = Red Hat Universal Base Image 8 (RPMs) - AppStream baseurl = http://<local pkg repository>/ubi-8-appstream enabled = 1 gpgcheck = 0 [ubi-8-appstream-source] name = Red Hat Universal Base Image 8 (Source RPMs) - AppStream baseurl = http://<local pkg repository>/ubi-8-appstream-source enabled = 0 gpgcheck = 0
Create the ConfigMap for Ubuntu:
kubectl create configmap repo-config -n <Network Operator Namespace> --from-file=<path-to-repo-list-file>
Create the ConfigMap for RHCOS:
kubectl create configmap repo-config -n <Network Operator Namespace> --from-file=cuda.repo --from-file=redhat.r epo --from-file=ubi.repo
Once the ConfigMap is created using the above command, update the values.yaml file with this information to let the Network Operator mount the repo configuration within the driver container and pull the required packages. Based on the OS distribution, the Network Operator will automatically mount this ConfigMap into the appropriate directory.
ofedDriver: deploy: true repoConfg: name: repo-config
If self-signed certificates are used for an HTTPS based internal repository, a ConfigMap must be created for those certifications and provided during the Network Operator installation. Based on the OS distribution, the Network Operator will automatically mount this ConfigMap into the appropriate directory.
kubectl create configmap cert-config -n <Network Operator Namespace> --from-file=<path-to-pem-file1> --from-file=<path-to-pem-file2>
ofedDriver: deploy: true certConfg: name: cert-config
Network Operator Deployment on an OpenShift Container Platform
Cluster-wide Entitlement
Please follow the GPU Operator Guide to enable cluster-wide entitlement.
Node Feature Discovery
To enable Node Feature Discovery please follow the Official Guide.
An example of Node Feature Discovery configuration:
apiVersion: nfd.openshift.io/v1 kind: NodeFeatureDiscovery metadata: name: nfd-instance namespace: openshift-nfd spec: operand: namespace: openshift-nfd image: registry.redhat.io/openshift4/ose-node-feature-discovery:v4.10 imagePullPolicy: Always workerConfig: configData: | sources: pci: deviceClassWhitelist: - "02" - "03" - "0200" - "0207" deviceLabelFields: - vendor customConfig: configData: ""
Verify that the following label is present on the nodes containing NVIDIA networking hardware:
feature.node.kubernetes.io/pci-15b3.present=true
$ oc describe node | egrep 'Roles|pci' | grep -v master Roles: worker feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-14e4.present=true feature.node.kubernetes.io/pci-15b3.present=true Roles: worker feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-14e4.present=true feature.node.kubernetes.io/pci-15b3.present=true Roles: worker feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-14e4.present=true feature.node.kubernetes.io/pci-15b3.present=true
SR-IOV Network Operator
If you are planning to use SR-IOV, follow this guide to install SR-IOV Network Operator in OpenShift Container Platform.
Note that the SR-IOV resources created will have the openshift.io
prefix.
For the default SriovOperatorConfig CR to work with the MOFED container, update the following values:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: enableInjector: false enableOperatorWebhook: false configDaemonNodeSelector: node-role.kubernetes.io/worker: "" network.nvidia.com/operator.mofed.wait: "false"
SR-IOV Network Operator configuration documentation can be found on the Official Website.
GPU Operator
If you plan to use GPUDirect, follow this guide to install GPU Operator in OpenShift Container Platform.
Make sure to enable RDMA and disable useHostMofed in the driver section in the spec of the ClusterPolicy CR.
Network Operator Installation Using an OpenShift Container Platform Console
- In the OpenShift Container Platform web console side menu, select Operators > OperatorHub, and search for the NVIDIA Network Operator.
- Select the NVIDIA Network Operator, and click Install in the first screen and in the subsequent one.
For additional information, see the Red Hat OpenShift Container Platform Documentation.
Network Operator Installation Using CLI
- Create a namespace for the Network Operator.
Create the following Namespace custom resource (CR) that defines the network-operator namespace, and then save the YAML in the
network-operator-namespace.yaml
file:apiVersion: v1 kind: Namespace metadata: name: nvidia-network-operator
Create the namespace by running the following command:
$ oc create -f network-operator-namespace.yaml
Install the Network Operator in the namespace created in the previous step by creating the below objects. Run the following command to get the channel value required for the next step:
$ oc get packagemanifest nvidia-network-operator -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
Example Output
stable
Create the following Subscription CR, and save the YAML in the
network-operator-sub.yaml
file:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nvidia-network-operator namespace: nvidia-network-operator spec: channel: "v1.4.0" installPlanApproval: Manual name: nvidia-network-operator source: certified-operators sourceNamespace: openshift-marketplace
Create the subscription object by running the following command:
$ oc create -f network-operator-sub.yaml
Change to the network-operator project:
$ oc project nvidia-network-operator
Verification
To verify that the operator deployment is successful, run:
$ oc get pods
Example Output:
NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-8f8ccf45c-zgfsq 2/2 Running 0 1m
A successful deployment shows a Running status.
Network Operator Configuration in an OpenShift Container Platform
In OCP, it is required to create the 'nvidia-network-operator-resources' namespace manually before creating the NicClusterPolicy CR.
See Deployment Examples for OCP.
Network Operator Upgrade
The network operator provides limited upgrade capabilities, which require additional manual actions if a containerized OFED driver is used. Future releases of the network operator will provide an automatic upgrade flow for the containerized driver.
Since Helm does not support auto-upgrade of existing CRDs, the user must follow a two-step process to upgrade the network-operator release:
- Upgrade the CRD to the latest version
- Apply Helm chart update
Searching for Available Releases
To find available releases, run:
$ helm search repo NVIDIA/network-operator -l
Add the --devel
option if you wish to list Beta releases as well.
Downloading CRDs for a Specific Release
It is possible to retrieve updated CRDs from the Helm chart or from the release branch on GitHub. The example below shows how to download and unpack an Helm chart for a specified release, and apply CRDs update from it.
$ helm pull NVIDIA/network-operator --version <VERSION> --untar --untardir network-operator-chart
The --devel
option is required if you wish to use the Beta release.
$ kubectl apply \ -f network-operator-chart/network-operator/crds \ -f network-operator-chart/network-operator/charts/sriov-network-operator/crds
Preparing the Helm Values for the New Release
Download the Helm values for the specific release:
Edit the values-<VERSION>.yaml
file as required for your cluster. The network operator has some limitations as to which updates in the NicClusterPolicy
it can handle automatically. If the configuration for the new release is different from the current configuration in the deployed release, some additional manual actions may be required.
Known limitations:
- If component configuration was removed from the
NicClusterPolicy
, manual clean up of the component's resources (DaemonSets, ConfigMaps, etc.) may be required. - If the configuration for
devicePlugin
changed without image upgrade, manual restart of thedevicePlugin
may be required.
These limitations will be addressed in future releases.
Changes that were made directly in the NicClusterPolicy
CR (e.g. with kubectl edit) will be overwritten by the Helm upgrade.
Temporarily Disabling the Network-operator
This step is required to prevent the old network-operator version from handling the updated NicClusterPolicy
CR. This limitation will be removed in future network-operator releases.
$ kubectl scale deployment --replicas=0 -n network-operator network-operator
Please wait for the network-operator pod to be removed before proceeding.
The network-operator will be automatically enabled by the Helm upgrade command. There is no need to enable it manually.
Applying the Helm Chart Update
To apply the Helm chart update, run:
$ helm upgrade -n network-operator network-operator NVIDIA/network-operator --version=<VERSION> -f values-<VERSION>.yaml
The --devel
option is required if you wish to use the beta release.
OFED Driver Manual Upgrade
Restarting pods with a Containerized OFED Driver
This operation is required only if containerized OFED is in use.
When a containerized OFED driver is reloaded on the node, all pods that use a secondary network based on NVIDIA NICs will lose network interface in their containers. To prevent outage, remove all pods that use a secondary network from the node before you reload the driver pod on it.
The Helm upgrade command will only upgrade the DaemonSet spec of the OFED driver to point to the new driver version. The OFED driver's DaemonSet will not automatically restart pods with the driver on the nodes, as it uses "OnDelete" updateStrategy. The old OFED version will still run on the node until you explicitly remove the driver pod or reboot the node:
It is possible to remove all pods with secondary networks from all cluster nodes, and then restart the OFED pods on all nodes at once.
The alternative option is to perform an upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver pod restart can be done on each node individually. In this case, pods with secondary networks should be removed from the single node only. There is no need to stop pods on all nodes.
For each node, follow these steps to reload the driver on the node:
- Remove pods with a secondary network from the node.
- Restart the OFED driver pod.
- Return the pods with a secondary network to the node.
When the OFED driver is ready, proceed with the same steps for other nodes.
Removing Pods with a Secondary Network from the Node
To remove pods with a secondary network from the node with node drain, run the following command:
$ kubectl drain <NODE_NAME> --pod-selector=<SELECTOR_FOR_PODS>
Replace <NODE_NAME> with -l "network.nvidia.com/operator.mofed.wait=false"
if you wish to drain all nodes at once.
Restarting the OFED Driver Pod
Find the OFED driver pod name for the node:
$ kubectl get pod -l app=mofed-<OS_NAME> -o wide -A
Example for Ubuntu 20.04:
kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A
Deleting the OFED Driver Pod from the Node
To delete the OFED driver pod from the node, run:
$ kubectl delete pod -n <DRIVER_NAMESPACE> <OFED_POD_NAME>
Replace <OFED_POD_NAME>
with -l app=mofed-ubuntu20.04
if you wish to remove OFED pods on all nodes at once.
A new version of the OFED pod will automatically start.
Returning Pods with a Secondary Network to the Node
After the OFED pod is ready on the node, you can make the node schedulable again.
The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule taint) the node, and return the pods to it:
$ kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false"
Automatic OFED Driver Upgrade
To enable automatic OFED upgrade, define the UpgradePolicy section for the ofedDriver in the NicClusterPolicy spec, and change the OFED version.
nicclusterpolicy.yaml:
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy namespace: nvidia-network-operator spec: ofedDriver: image: mofed repository: mellanox version: 5.9-0.5.6.0 upgradePolicy: # autoUpgrade is a global switch for automatic upgrade feature # if set to false all other options are ignored autoUpgrade: true # maxParallelUpgrades indicates how many nodes can be upgraded in parallel # 0 means no limit, all nodes will be upgraded in parallel maxParallelUpgrades: 1 # describes configuration for node drain during automatic upgrade drain: # allow node draining during upgrade enable: true # allow force draining force: false # specify a label selector to filter pods on the node that need to be drained podSelector: "" # specify the length of time in seconds to wait before giving up drain, zero means infinite timeoutSeconds: 300 # specify if should continue even if there are pods using emptyDir deleteEmptyDir: false
Apply NicClusterPolicy CRD:
$ kubectl apply -f nicclusterpolicy.yaml
To be able to drain nodes, please make sure to fulfill PodDisruptionBudget for all the pods that use it.
Node Upgrade States
The status upgrade of each node is reflected in its nvidia.com/ofed-upgrade-state
annotation. This annotation can have the following values:
Name | Description |
---|---|
Unknown (empty) | This value is set when the upgrade flow is disabled or the node has not been processed yet. |
upgrade-done | This value is set when the OFED pod is up to date and running on the node, and the node is schedulable - UpgradeStateDone = "upgrade-done". |
upgrade-required | This value is set when the OFED pod on the node is not up-to-date and requires upgrade. No actions are performed at this stage. |
drain | This value is set when the node is scheduled for drain. Following the drain, the state is changed either to pod-restart or to drain-failed UpgradeStateDrain = "drain". |
pod-restart | This value is set when the OFED pod on the node is scheduled for restart. Following the restart, the state is changed to uncordon-required. |
drain-failed | This value is set when the drain on the node has failed. A manual interaction is required at this stage. See the Troubleshooting section for more details. |
uncordon-required | This value is set when the OFED pod on the node is up-to-date, and has a "Ready" status. After the uncordone command, the state is changed to upgrade-done. |
Depending on your cluster workloads and pod Disruption Budget, set the following values for auto upgrade:
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy namespace: nvidia-network-operator spec: ofedDriver: image: mofed repository: mellanox version: 5.9-0.5.6.0 upgradePolicy: autoUpgrade: true maxParallelUpgrades: 1 drain: enable: true force: false deleteEmptyDir: true
Troubleshooting
Issue | Required Action |
---|---|
The node is in drain-failed state. | Drain the node manually by running |
The updated MOFED pod failed to start/ a new version of MOFED cannot be installed on the node. | Manually delete the pod by using |
Ensuring Deployment Readiness
Once the Network Operator is deployed, and a NicClusterPolicy resource is created, the operator will reconcile the state of the cluster until it reaches the desired state, as defined in the resource.
Alignment of the cluster to the defined policy can be verified in the custom resource status.
a "Ready" state indicates that the required components were deployed, and that the policy is applied on the cluster.
Example Status Field of a NICClusterPolicy Instance
Status: Applied States: Name: state-OFED State: ready Name: state-RDMA-device-plugin State: ready Name: state-NV-Peer State: ignore Name: state-cni-plugins State: ignore Name: state-Multus State: ready Name: state-whereabouts State: ready State: ready
An "Ignore" state indicates that the sub-state was not defined in the custom resource, and thus, it is ignored.
Uninstalling the Network Operator
To uninstall the operator, run:
$ helm delete -n network-operator $(helm list -n network-operator | grep network-operator | awk '{print $1}') $ kubectl -n network-operator delete daemonsets.apps sriov-device-plugin
You should now see all the pods being deleted:
In addition, make sure that the CRDs created during the operator installation have been removed:
$ kubectl get nicclusterpolicies.mellanox.com No resources found
When installing the Network Operator with MOFED in containers, it is required to reload the mlx5_core kernel module for Ethernet NICs, and the ib_ipoib for InfiniBand NICs after MOFED is uninstalled.
Uninstalling the Network Operator on an OpenShift Container Platform
Network Operator Uninstallation Using an OpenShift Container Platform Console
In the OpenShift Container Platform web console side menu, select Operators >Installed Operators, search for the NVIDIA Network Operator and click on it.
On the right side of the Operator Details page, select Uninstall Operator from the Actions drop-down menu.
For additional information, see the Red Hat OpenShift Container Platform Documentation.
Network Operator Uninstallation Using CLI in OpenShift Container Platform
Check the current version of the Network Operator in the currentCSV field:
$ oc get subscription -n nvidia-network-operator nvidia-network-operator -o yaml | grep currentCSV
Example output:
currentCSV: nvidia-network-operator.v1.4.0
Delete the subscription:
$ oc delete subscription -n nvidia-network-operator nvidia-network-operator
Example output:
subscription.operators.coreos.com "nvidia-network-operator" deleted
Delete the CSV using the currentCSV value from previous step:
subscription.operators.coreos.com "nvidia-network-operator" deleted
Example output:
clusterserviceversion.operators.coreos.com "nvidia-network-operator.v1.2.1" deleted
For additional information, see the Red Hat OpenShift Container Platform Documentation.
Additional Steps
Remove CRDs and CRs:
In OCP, uninstalling an operator does not remove its managed resources, including CRDs and CRs.
To remove them, you must manually delete the Operator CRDs following the operator uninstallation.
Run:
$ oc delete crds hostdevicenetworks.mellanox.com macvlannetworks.mellanox.com nicclusterpolicies.mellanox.com
Deployment Examples
Since several parameters should be provided when creating custom resources during operator deployment, it is recommended to use a configuration file. While it is possible to override the parameters via CLI, it would be cumbersome, and therefore, not recommended.
Below are deployment examples, which the values.yaml
file provided to the Helm during the installation of the network operator. This was achieved by running:
$ helm install -f ./values.yaml -n network-operator --create-namespace --wait NVIDIA/network-operator network-operator
Network Operator Deployment with the RDMA Shared Device Plugin
Network operator deployment with the default version of the OFED driver and a single RDMA resource mapped to enp1 netdev.:
values.yaml
configuration file for such a deployment:
nfd: enabled: true sriovNetworkOperator: enabled: false # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true nvPeerDriver: deploy: false rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a ifNames: [ens1f0] sriovDevicePlugin: deploy: false
Network Operator Deployment with Multiple Resources in RDMA Shared Device Plugin
Network Operator deployment with the default version of OFED and an RDMA device plugin with two RDMA resources. The first is mapped to enp1 and enp2, and the second is mapped to enp3.
values.yaml
configuration file for such a deployment:
nfd: enabled: true sriovNetworkOperator: enabled: false # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true nvPeerDriver: deploy: false rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a ifNames: [ens1f0, ens1f1] - name: rdma_shared_device_b ifNames: [ens2f0, ens2f1] sriovDevicePlugin: deploy: false
Network Operator Deployment with a Secondary Network
Network Operator deployment with:
- RDMA shared device plugin
- Secondary network
- Mutlus CNI
- Containernetworking-plugins CNI plugins
- Whereabouts IPAM CNI Plugin
values.yaml
:
nfd: enabled: true sriovNetworkOperator: enabled: false # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: false rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a ifNames: [ens1f0] secondaryNetwork: deploy: true multus: deploy: true cniPlugins: deploy: true ipamPlugin: deploy: true
Network Operator Deployment with a Host Device Network
Network operator deployment with:
- SR-IOV device plugin, single SR-IOV resource pool
- Secondary network
- Mutlus CNI
- Containernetworking-plugins CNI plugins
- Whereabouts IPAM CNI plugin
In this mode, the Network Operator could be deployed on virtualized deployments as well. It supports both Ethernet and InfiniBand modes. From the Network Operator perspective, there is no difference between the deployment procedures. To work on a VM (virtual machine), the PCI passthrough must be configured for SR-IOV devices. The Network Operator works both with VF (Virtual Function) and PF (Physical Function) inside the VMs.
values.yaml
:
nfd: enabled: true sriovNetworkOperator: enabled: false # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: false rdmaSharedDevicePlugin: deploy: false sriovDevicePlugin: deploy: true resources: - name: hostdev vendors: [15b3] secondaryNetwork: deploy: true multus: deploy: true cniPlugins: deploy: true ipamPlugin: deploy: true
After deployment, the network operator should be configured, and K8s networking is deployed in order to use it in pod configuration.
The host-device-net.yaml
configuration file for such a deployment:
apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: hostdev-net spec: networkNamespace: "default" resourceName: "nvidia.com/hostdev" ipam: | { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ], "log_file" : "/var/log/whereabouts.log", "log_level" : "info" }
The host-device-net-ocp.yaml
configuration file for such a deployment in the OpenShift Platform:
apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: hostdev-net spec: networkNamespace: "default" resourceName: "nvidia.com/hostdev" ipam: | { "type": "whereabouts", "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ] }
The pod.yaml
configuration file for such a deployment:
apiVersion: v1 kind: Pod metadata: name: hostdev-test-pod annotations: k8s.v1.cni.cncf.io/networks: hostdev-net spec: restartPolicy: OnFailure containers: - image: name: mofed-test-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: nvidia.com/hostdev: 1 limits: nvidia.com/hostdev: 1 command: - sh - -c - sleep inf
Network Operator Deployment with an IP over InfiniBand (IPoIB) Network
Network operator deployment with:
- RDMA shared device plugin
- Secondary network
- Mutlus CNI
- IPoIB CNI
- Whereabouts IPAM CNI plugin
In this mode, the Network Operator could be deployed on virtualized deployments as well. It supports both Ethernet and InfiniBand modes. From the Network Operator perspective, there is no difference between the deployment procedures. To work on a VM (virtual machine), the PCI passthrough must be configured for SR-IOV devices. The Network Operator works both with VF (Virtual Function) and PF (Physical Function) inside the VMs.
values.yaml
:
nfd: enabled: true sriovNetworkOperator: enabled: false # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a ifNames: [ibs1f0] secondaryNetwork: deploy: true multus: deploy: true ipoib: deploy: true ipamPlugin: deploy: true
Following the deployment, the network operator should be configured, and K8s networking deployed in order to use it in the pod configuration.
The ipoib-net.yaml
configuration file for such a deployment:
apiVersion: mellanox.com/v1alpha1 kind: IPoIBNetwork metadata: name: example-ipoibnetwork spec: networkNamespace: "default" master: "ibs1f0" ipam: | { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "192.168.5.225/28", "exclude": [ "192.168.6.229/30", "192.168.6.236/32" ], "log_file" : "/var/log/whereabouts.log", "log_level" : "info", "gateway": "192.168.6.1" }
The ipoib-net-ocp.yaml
configuration file for such a deployment in the OpenShift Platform:
apiVersion: mellanox.com/v1alpha1 kind: IPoIBNetwork metadata: name: example-ipoibnetwork spec: networkNamespace: "default" master: "ibs1f0" ipam: | { "type": "whereabouts", "range": "192.168.5.225/28", "exclude": [ "192.168.6.229/30", "192.168.6.236/32" ] }
The pod.yaml
configuration file for such a deployment:
apiVersion: v1 kind: Pod metadata: name: iboip-test-pod annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: restartPolicy: OnFailure containers: - image: name: mofed-test-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: rdma/rdma_shared_device_a: 1 limits: edma/rdma_shared_device_a: 1 command: - sh - -c - sleep inf
Network Operator Deployment for GPUDirect Workloads
GPUDirect requires the following:
- MOFED v5.5-1.0.3.2 or newer
- GPU Operator v1.9.0 or newer
- NVIDIA GPU and driver supporting GPUDirect e.g Quadro RTX 6000/8000 or NVIDIA T4/NVIDIA V100/NVIDIA A100
values.yaml
example:
nfd: enabled: true sriovNetworkOperator: enabled: false # NicClusterPolicy CR values: ofedDriver: deploy: true deployCR: true sriovDevicePlugin: deploy: true resources: - name: hostdev vendors: [15b3] secondaryNetwork: deploy: true multus: deploy: true cniPlugins: deploy: true ipamPlugin: deploy: true
host-device-net.yaml
:
apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: hostdevice-net spec: networkNamespace: "default" resourceName: "hostdev" ipam: | { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ], "log_file" : "/var/log/whereabouts.log", "log_level" : "info" }
host-device-net-ocp.yaml
configuration file for such a deployment in OpenShift Platform:
apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: hostdevice-net spec: networkNamespace: "default" resourceName: "hostdev" ipam: | { "type": "whereabouts", "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ] }
host-net-gpudirect-pod.yaml
:
apiVersion: v1 kind: Pod metadata: name: testpod1 annotations: k8s.v1.cni.cncf.io/networks: hostdevice-net spec: containers: - name: appcntr1 image: <image> imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["IPC_LOCK"] command: - sh - -c - sleep inf resources: requests: nvidia.com/hostdev: '1' nvidia.com/gpu: '1' limits: nvidia.com/hostdev: '1' nvidia.com/gpu: '1'
Network Operator Deployment in SR-IOV Legacy Mode
The SR-IOV Network Operator will be deployed with the default configuration. You can override these settings using a CLI argument, or the ‘sriov-network-operator
’ section in the values.yaml
file. For more information, refer to the Project Documentation.
This deployment mode supports SR-IOV in legcacy mode.
values.yaml
configuration file for such a deployment:
nfd: enabled: true sriovNetworkOperator: enabled: true # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true rdmaSharedDevicePlugin: deploy: false sriovDevicePlugin: deploy: false secondaryNetwork: deploy: true multus: deploy: true cniPlugins: deploy: true ipamPlugin: deploy: true
Following the deployment, the Network Operator should be configured, and sriovnetwork node policy and K8s networking should be deployed.
The sriovnetwork-node-policy.yaml
configuration file for such a deployment:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-1 namespace: network-operator spec: deviceType: netdevice mtu: 1500 nicSelector: vendor: "15b3" pfNames: ["ens2f0"] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 8 priority: 90 isRdma: true resourceName: sriov_resource
The sriovnetwork.yaml
configuration file for such a deployment:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: "example-sriov-network" namespace: network-operator spec: vlan: 0 networkNamespace: "default" resourceName: "sriov_resource" ipam: |- { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.101.0/24" }
The ens2f0 network interface name has been chosen from the following command output:kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io -o yaml.
... status: interfaces: - deviceID: 101d driver: mlx5_core linkSpeed: 100000 Mb/s linkType: ETH mac: 0c:42:a1:2b:74:ae mtu: 1500 name: ens2f0 pciAddress: "0000:07:00.0" totalvfs: 8 vendor: 15b3 - deviceID: 101d driver: mlx5_core linkType: ETH mac: 0c:42:a1:2b:74:af mtu: 1500 name: ens2f1 pciAddress: "0000:07:00.1" totalvfs: 8 vendor: 15b3 ...
Wait for all required pods to be spawned:
# kubectl get pod -n network-operator | grep sriov network-operator-sriov-network-operator-544c8dbbb9-vzkmc 1/1 Running 0 5d sriov-device-plugin-vwpzn 1/1 Running 0 2d6h sriov-network-config-daemon-qv467 3/3 Running 0 5d # kubectl get pod -n nvidia-network-operator NAME READY STATUS RESTARTS AGE cni-plugins-ds-kbvnm 1/1 Running 0 5d cni-plugins-ds-pcllg 1/1 Running 0 5d kube-multus-ds-5j6ns 1/1 Running 0 5d kube-multus-ds-mxgvl 1/1 Running 0 5d mofed-ubuntu20.04-ds-2zzf4 1/1 Running 0 5d mofed-ubuntu20.04-ds-rfnsw 1/1 Running 0 5d whereabouts-nw7hn 1/1 Running 0 5d whereabouts-zvhrv 1/1 Running 0 5d ...
pod.yaml
configuration file for such a deployment:
apiVersion: v1 kind: Pod metadata: name: testpod1 annotations: k8s.v1.cni.cncf.io/networks: example-sriov-network spec: containers: - name: appcntr1 image: <image> imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["IPC_LOCK"] resources: requests: nvidia.com/sriov_resource: '1' limits: nvidia.com/sriov_resource: '1' command: - sh - -c - sleep inf
Network Operator Deployment with an SR-IOV InfiniBand Network
Network Operator deployment with InfiniBand network requires the following:
- MOFED and OpenSM running. OpenSM runs on top of the MOFED stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to this article.
- InfiniBand device – Both host device and switch ports must be enabled in InfiniBand mode.
- rdma-core package should be installed when an inbox driver is used.
values.yaml
:
nfd: enabled: true sriovNetworkOperator: enabled: true # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true rdmaSharedDevicePlugin: deploy: false sriovDevicePlugin: deploy: false secondaryNetwork: deploy: true multus: deploy: true cniPlugins: deploy: true ipamPlugin: deploy: true
sriov-ib-network-node-policy.yaml
:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: infiniband-sriov namespace: network-operator spec: deviceType: netdevice mtu: 1500 nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" nicSelector: vendor: "15b3" linkType: ib isRdma: true numVfs: 8 priority: 90 resourceName: mlnxnics
sriov-ib-network.yaml
:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: example-sriov-ib-network namespace: network-operator spec: ipam: | { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "192.168.5.225/28", "exclude": [ "192.168.5.229/30", "192.168.5.236/32" ], "log_file": "/var/log/whereabouts.log", "log_level": "info" } resourceName: mlnxnics linkState: enable networkNamespace: default
sriov-ib-network-pod.yaml
:
apiVersion: v1 kind: Pod metadata: name: test-sriov-ib-pod annotations: k8s.v1.cni.cncf.io/networks: example-sriov-ib-network spec: containers: - name: test-sriov-ib-pod image: centos/tools imagePullPolicy: IfNotPresent command: - sh - -c - sleep inf securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: nvidia.com/mlnxics: "1" limits: nvidia.com/mlnxics: "1"
Network Operator Deployment with an SR-IOV InfiniBand Network with PKey Management
Network Operator deployment with InfiniBand network requires the following:
- MOFED and OpenSM running. OpenSM runs on top of the MOFED stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to this article.
- NVIDIA® UFM® running on top of OpenSM. For more details, please refer to the project's documentation.
- InfiniBand device – Both host device and switch ports must be enabled in InfiniBand mode.
- rdma-core package should be installed when an inbox driver is used.
Current limitations:
- Only a single PKey can be configured per workload pod.
- When a single instance of NVIDIA® UFM® is used with several K8s clusters, different PKey GUID pools should be configured for each cluster.
values.yaml
:
nfd: enabled: true sriovNetworkOperator: enabled: true resourcePrefix: "nvidia.com" # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true rdmaSharedDevicePlugin: deploy: false sriovDevicePlugin: deploy: false ibKubernetes: deploy: true periodicUpdateSeconds: 5 pKeyGUIDPoolRangeStart: "02:00:00:00:00:00:00:00" pKeyGUIDPoolRangeEnd: "02:FF:FF:FF:FF:FF:FF:FF" ufmSecret: ufm-secret secondaryNetwork: deploy: true multus: deploy: true cniPlugins: deploy: true ipamPlugin: deploy: true
ufm-secret.yaml
:
apiVersion: v1 kind: Secret metadata: name: ib-kubernetes-ufm-secret namespace: network-operator stringData: UFM_USERNAME: "admin" UFM_PASSWORD: "123456" UFM_ADDRESS: "ufm-host" UFM_HTTP_SCHEMA: "" UFM_PORT: "" data: UFM_CERTIFICATE: ""
Wait for MOFED to install and apply the following CRs:
sriov-ib-network-node-policy.yaml
:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: infiniband-sriov namespace: network-operator spec: deviceType: netdevice mtu: 1500 nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" nicSelector: vendor: "15b3" linkType: ib isRdma: true numVfs: 8 priority: 90 resourceName: mlnxnics
sriov-ib-network.yaml
:
apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ib-sriov-network annotations: k8s.v1.cni.cncf.io/resourceName: nvidia.com/mlnxnics spec: config: '{ "type": "ib-sriov", "cniVersion": "0.3.1", "name": "ib-sriov-network", "pkey": "0x6", "link_state": "enable", "ibKubernetesEnabled": true, "ipam": { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "10.56.217.0/24", "log_file" : "/var/log/whereabouts.log", "log_level" : "info" } }'
sriov-ib-network-pod.yaml
:
apiVersion: v1 kind: Pod metadata: name: test-sriov-ib-pod annotations: k8s.v1.cni.cncf.io/networks: ib-sriob-network spec: containers: - name: test-sriov-ib-pod image: centos/tools imagePullPolicy: IfNotPresent command: - sh - -c - sleep inf securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: nvidia.com/mlnxics: "1" limits: nvidia.com/mlnxics: "1"
Network Operator Deployment for DPDK Workloads with NicClusterPolicy
This deployment mode supports DPDK applications. In order to run DPDK applications, HUGEPAGE should be configured on the required K8s Worker Nodes. By default, the inbox operating system driver is used. For support of cases with specific requirements, OFED container should be deployed.
Network Operator deployment with:
- Host Device Network, DPDK pod
nicclusterpolicy.yaml
:
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: mofed repository: nvcr.io/nvidia/mellanox version: 5.9-0.5.6.0 sriovDevicePlugin: image: sriov-network-device-plugin repository: ghcr.io/k8snetworkplumbingwg version: a765300344368efbf43f71016e9641c58ec1241b config: | { "resourceList": [ { "resourcePrefix": "nvidia.com", "resourceName": "rdma_host_dev", "selectors": { "vendors": ["15b3"], "devices": ["1018"], "drivers": ["mlx5_core"] } } ] } psp: enabled: false secondaryNetwork: cniPlugins: image: plugins repository: ghcr.io/k8snetworkplumbingwg version: v0.8.7-amd64 ipamPlugin: image: whereabouts repository: ghcr.io/k8snetworkplumbingwg version: v0.4.2-amd64 multus: image: multus-cni repository: ghcr.io/k8snetworkplumbingwg version: v3.8 secondaryNetwork: cniPlugins: image: plugins repository: ghcr.io/k8snetworkplumbingwg version: v0.8.7-amd64
host-device-net.yaml
:
apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: example-hostdev-net spec: networkNamespace: "default" resourceName: "rdma_host_dev" ipam: | { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ], "log_file" : "/var/log/whereabouts.log", "log_level" : "info" }
pod.yaml
:
apiVersion: v1 kind: Pod metadata: name: testpod1 annotations: k8s.v1.cni.cncf.io/networks: example-hostdev-net spec: containers: - name: appcntr1 image: <dpdk image> imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["IPC_LOCK"] volumeMounts: - mountPath: /dev/hugepages name: hugepage resources: requests: memory: 1Gi hugepages-1Gi: 2Gi nvidia.com/rdma_host_dev: '1' command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] volumes: - name: hugepage emptyDir: medium: HugePages
Deployment Examples For OpenShift Container Platform
In OCP, some components are deployed by default like Multus and WhereAbouts, whereas others, such as NFD and SR-IOV Network Operator must be deployed manually, as described in the Installation section.
In addition, since there is no use of the Helm chart, the configuration should be done via the NicClusterPolicy CRD.
Following are examples of NicClusterPolicy configuration for OCP.
Network Operator Deployment with a Host Device Network - OCP
Network Operator deployment with:
SR-IOV device plugin, single SR-IOV resource pool:
There is no need for a secondary network configuration, as it is installed by default in the OCP.apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: mofed repository: nvcr.io/nvidia/mellanox version: 5.9-0.5.6.0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 sriovDevicePlugin: image: sriov-network-device-plugin repository: ghcr.io/k8snetworkplumbingwg version: a765300344368efbf43f71016e9641c58ec1241b config: | { "resourceList": [ { "resourcePrefix": "nvidia.com", "resourceName": "host_dev", "selectors": { "vendors": ["15b3"], "isRdma": true } } ] }
Following the deployment, the Network Operator should be configured, and K8s networking deployed in order to use it in pod configuration. The
host-device-net.yaml
configuration file for such a deployment:apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: hostdev-net spec: networkNamespace: "default" resourceName: "nvidia.com/hostdev" ipam: | { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ], "log_file" : "/var/log/whereabouts.log", "log_level" : "info" }
The
pod.yaml
configuration file for such a deployment:apiVersion: v1 kind: Pod metadata: name: hostdev-test-pod annotations: k8s.v1.cni.cncf.io/networks: hostdev-net spec: restartPolicy: OnFailure containers: - image: <rdma image> name: mofed-test-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: nvidia.com/hostdev: 1 limits: nvidia.com/hostdev: 1 command: - sh - -c - sleep inf
Network Operator Deployment with SR-IOV Legacy Mode - OCP
This deployment mode supports SR-IOV in legacy mode.
Note that the SR-IOV Network Operator is required as described in the Deployment for OCP section.
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: mofed repository: nvcr.io/nvidia/mellanox version: 5.9-0.5.6.0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30
Sriovnetwork node policy and K8s networking should be deployed. sriovnetwork-node-policy.yaml
configuration file for such a deployment:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-1 namespace: network-operator spec: deviceType: netdevice mtu: 1500 nicSelector: vendor: "15b3" pfNames: ["ens2f0"] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 5 priority: 90 isRdma: true resourceName: sriovlegacy
The sriovnetwork.yaml
configuration file for such a deployment:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: "sriov-network" namespace: network-operator spec: vlan: 0 networkNamespace: "default" resourceName: "sriov_network ipam: |- { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.101.0/24" }
Note that the resource prefix in this case will be openshift.io.
The pod.yaml
configuration file for such a deployment:
apiVersion: v1 kind: Pod metadata: name: testpod1 annotations: k8s.v1.cni.cncf.io/networks: sriov-network spec: containers: - name: appcntr1 image: <image> imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["IPC_LOCK"] command: - sh - -c - sleep inf resources: requests: openshift.io/sriov_network: '1' limits: openshift.io/sriov_network: '1' nodeSelector: feature.node.kubernetes.io/pci-15b3.sriov.capable: "true"
Network Operator Deployment with the RDMA Shared Device Plugin - OCP
The following is an example of RDMA Shared with MacVlanNetwork:
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: mofed repository: nvcr.io/nvidia/mellanox version: 5.9-0.5.6.0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_88", "rdmaHcaMax": 1000, "selectors": { "vendors": ["15b3"], "deviceIDs": ["101d"], "drivers": [], "ifNames": ["ens1f0", "ens2f0"], "linkTypes": [] } } ] } image: k8s-rdma-shared-dev-plugin repository: nvcr.io/nvidia/cloud-native version: v1.3.2
The macvlan-net.yaml configuration file for such a deployment:
apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdma-shared-88 spec: networkNamespace: default master: enp4s0f0np0 mode: bridge mtu: 1500 ipam: '{"type": "whereabouts", "datastore": "kubernetes", "kubernetes": {"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"}, "range": "16.0.2.0/24", "log_file" : "/var/log/whereabouts.log", "log_level" : "info", "gateway": "16.0.2.1"}'
The macvlan-net-ocp.yaml configuration file for such a deployment in OpenShift Platform:
apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdma-shared-88 spec: networkNamespace: default master: enp4s0f0np0 mode: bridge mtu: 1500 ipam: '{"type": "whereabouts", "range": "16.0.2.0/24", "gateway": "16.0.2.1"}'
apiVersion: v1 kind: Pod metadata: name: test-rdma-shared-1 annotations: k8s.v1.cni.cncf.io/networks: rdma-shared-88 spec: containers: - image: myimage name: rdma-shared-1 securityContext: capabilities: add: - IPC_LOCK resources: limits: rdma/rdma_shared_88: 1 requests: rdma/rdma_shared_88: 1 restartPolicy: OnFailure
Network Operator Deployment for DPDK Workloads - OCP
In order to configure HUGEPAGES in OpenShift, refer to this guide.
For Network Operator configuration instructions, see here.
NicClusterPolicy CRD
For more information on NicClusterPolicy custom resource, please refer to the Network-Operator Project Sources.
MacVlanNetwork CRD
For more information on MacVlanNetwork custom resource, please refer to the Network-Operator Project Sources.
HostDeviceNetwork CRD
For more information on HostDeviceNetwork custom resource, please refer to the Network-Operator Project Sources.
IPoIBNetwork CRD
For more information on IPoIBNetwork custom resource, please refer to the Network-Operator Project Sources.