image image image image image

On This Page

Nvidia Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking related components, in order to enable fast networking, RDMA and GPUDirect for workloads in a Kubernetes cluster. The Network Operator works in conjunction with the GPU-Operator to enable GPU-Direct RDMA on compatible systems.

The goal of the Network Operator is to manage the networking related components, while enabling execution of RDMA and GPUDirect RDMA workloads in a Kubernetes cluster. This includes:

  • NVIDIA Networking drivers to enable advanced features
  • Kubernetes device plugins to provide hardware resources required for a fast network
  • Kubernetes secondary network components for network intensive workloads

Network Operator Release Notes

New Features

VersionFeature Description
1.1.0


Added support for OpenShift Container Platform 4.9.
Added support for Network Operator upgrade from v1.0.0.
Added support for Kubernetes Pod Security Policy. 
Added support for Kubernetes >= 1.17 and <=1.22.
Added the ability to propagate nodeAffinity property from the NicClusterPolicy to network operator dependencies.
1.0.0Added Node Feature Discovery that can be used to mark nodes with Nvidia SR-IOV NICs.
Added support for different networking models:
  • Macvlan Network
  • HostDevice Network
  • SR-IOV Network
Added Kubernetes cluster scale-up support.
Published Network Operator image at NGC.
Added support for Kubernetes >= 1.17and <=1.21.

Bug Fixes

VersionFeature Description
1.1.0


Fixed the Whereabouts IPAM plugin to work with Kubernetes v1.22.
Fixed imagePullSecrets for Network Operator.
Enabling resource names for HostDeviceNetwork to be accepted both with and without a prefix.

Known Limitations 

VersionFeature Description
1.1.0



NicClusterPolicy update is not supported at the moment. 

Network Operator is compatible only with NVIDIA GPU Operator v1.9.0 and above. 

GPUDirect could have performance degradation if it is used with servers which are not optimized. Please see official GPUDirect documentation here

Persistent NICs configuration for netplan or ifupdown scripts is required for SR-IOV and Shared RDMA interfaces on the host.

Pod Security Policy admission controller should be enabled to use PSP with Network Operator. Please see Deployment with Pod Security Policy in the Network Operator Documentation for details.

1.0.0

Network Operator is only compatible with NVIDIA GPU Operator v1.5.2 and above.

Persistent NICs configuration for netplan or ifupdown scripts is required for SR-IOV and Shared RDMA interfaces on the host.

System Requirements

  • RDMA capable hardware: NVIDIA ConnectX-5 NIC or newer
  • NVIDIA GPU and driver supporting GPUDirect - e.g Quadro RTX 6000/8000, NVIDIA T4/NVIDIA A100/NVIDIA V100 (GPU-Direct only)
  • GPU Operator Version 1.9.0 (required only for GPUDirect)
  • Operating System: Ubuntu 20.04, OpenShift Container Platform 4.9
  • Container runtime: containerd

Prerequisites

Component

Version

Notes

Kubernetes

>=1.14 and <=1.22

-

Helm

v.3.5+

For information and methods of Helm installation, please refer to the official Helm Website

Versions

The following component versions are deployed by the Network Operator:

Component

Version

Comments

Node Feature Discovery

v0.8.2

Optionally deployed. May already be present in the cluster with proper configuration.

NVIDIA OFED driver container

5.5-1.0.3.2

-

nv-peer-mem driver container

1.1-0

-

k8s-rdma-shared-device-plugin

v1.2.1

-

sriov-network-device-plugin

sriov-network-device-plugin: Commit-ID: a765300344368efbf43f71016e9641c58ec1241b

-

containernetworking CNI

v0.8.7

-

whereabouts CNI

V0.4.2

-

multus CNI

v3.8

-

Network Operator Deployment on Vanilla K8s Cluster

The default installation via Helm as described below will deploy the Network Operator and related CRDs, after which an additional step is required to create a NicClusterPolicy custom resource with the configuration that is desired for the cluster. Please refer to the NicClusterPolicy CRD Section for more information on manual Custom Resource creation.

The provided Helm chart contains various parameters to facilitate the creation of a NicClusterPolicy custom resource upon deployment. For a full list of chart parameters, refer to the Network Operator Helm Chart README.

Each Operator Release has a set of default version values for the various components it deploys. It is recommended that these values will not be changed. Testing and validation were performed with these values, and there is no guarantee of interoperability nor correctness when different versions are used.

To install the operator with chart default values, run:

# Add Repo
$ helm repo add mellanox https://mellanox.github.io/network-operator
$ helm repo update
 
# Install Operator
$ helm install -n network-operator --create-namespace --wait network-operator mellanox/network-operator
 
# View deployed resources
$ kubectl -n network-operator get pods
$ kubectl get pod -n nvidia-network-operator-resources

Chart Customization Options

To customize Network Operator Chart please refer to this document: https://github.com/Mellanox/network-operator/tree/master/deployment/network-operator#chart-parameters.

Since several parameters should be provided when creating custom resources during operator deployment, it is recommended to use a configuration file. While it is possible to override the parameters via CLI, we recommend to avoid the use of CLI arguments in favor of a configuration file.

$ helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator network-operator

 By default, the Network Operator deploys the Node Feature Discovery (NFD), in order to perform node labeling in the cluster. This allows proper scheduling of Network Operator resources.

If the nodes have already been labeled by other means, it is possible to disable the deployment of the NFD by setting the nfd.enabled=false chart parameter:

$ helm install --set nfd.enabled=false -n network-operator --create-namespace --wait network-operator mellanox/network-operator

Currently, the following NFD labels are used:

Label

Location

feature.node.kubernetes.io/pci-15b3.present

Nodes containing NVIDIA Networking hardware

feature.node.kubernetes.io/pci-10de.present

Nodes containing NVIDIA GPU hardware

The labels which the Network Operator depends on may change between releases.

Deployment with Pod Security Policy

A Pod Security Policy is a cluster-level resource that controls security sensitive aspects of the pod specification. The PodSecurityPolicy objects define a set of conditions that a pod must run with in order to be accepted into the system, as well as defaults for the related fields.

By default, the NVIDIA Network Operator does not deploy Pod Security Policy. To do that, override the psp chart parameter:

$ helm install -n network-operator --create-namespace --wait network-operator mellanox/network-operator --set psp.enabled=true

To enforce Pod Security Policies, PodSecurityPolicy admission controller must be enabled. For instructions, refer to this article in Kubernetes Documentation.

The NVIDIA Network Operator deploys a privileged Pod Security Policy, which provides the operator’s pods the following permissions:

  privileged: true
  hostIPC: false
  hostNetwork: true
  hostPID: false
  allowPrivilegeEscalation: true
  readOnlyRootFilesystem: false
  allowedHostPaths: []
  allowedCapabilities:
    - '*'
  fsGroup:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
    - configMap
    - hostPath
    - secret
    - downwardAPI

PodSecurityPolicy is deprecated as of Kubernetes v1.21 and will be removed in v1.25.

kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd
spec:
  customConfig:
    configData: ""
  instance: ""
  operand:
    image: quay.io/openshift/origin-node-feature-discovery:4.8
    imagePullPolicy: Always
    namespace: openshift-nfd
    servicePort: 0
  workerConfig:
    configData: |
      sources:
        pci:
          deviceClassWhitelist:
            - "02"
            - "0200"
            - "0207"
          deviceLabelFields:
            - vendor

Network Operator Deployment on an OpenShift Container Platform

Cluster-wide Entitlement

Please follow the GPU Operator Guide to enable cluster-wide entitlement.

Network Operator Installation Using an OpenShift Container Platform Console

  1. In the OpenShift Container Platform web console side menu, select Operators > OperatorHub, and search for the NVIDIA Network Operator.
  2. Select the NVIDIA Network Operator, and click Install in the first screen and in the subsequent one.
    For additional information, see the
    Red Hat OpenShift Container Platform Documentation.

Network Operator Installation Using CLI

  1. Create a namespace for the Network Operator.

    Create the following Namespace custom resource (CR) that defines the network-operator namespace, and then save the YAML in the network-operator-namespace.yaml file:

    apiVersion: v1
    kind: Namespace
    metadata:
      name: network-operator

    Create the namespace by running the following command:

    $ oc create -f network-operator-namespace.yaml
  2. Install the Network Operator in the namespace you created in the previous step by creating the below objects.
    Run the following command to get the channel value required for the next step:

    $ oc get packagemanifest network-operator -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'

    Example Output

    stable
  3.  Create the following Subscription CR, and save the YAML in the network-operator-sub.yaml file:

    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: network-operator
      namespace: network-operator
    spec:
      channel: "stable"
      installPlanApproval: Manual
      name: network-operator
        sourceNamespace: openshift-marketplace
  4. Create the subscription object by running the following command:

    $ oc create -f network-operator-sub.yaml
  5. Change to the network-operator project:

    $ oc project network-operator

Verification

To verify that the operator deployment is successful, run:

$ oc get pods

Example Output

NAME                                      READY   STATUS    RESTARTS   AGE
nvidia-network-operator-controller-manager-8f8ccf45c-zgfsq    2/2     Running   0          1m

A successful deployment shows a Running status.

Network Operator Upgrade

The network operator provides limited upgrade capabilities, which require additional manual actions if a containerized OFED driver is used. Future releases of the network operator will provide an automatic upgrade flow for the containerized driver.

Since Helm does not support auto-upgrade of existing CRDs, the user must follow a two-step process to upgrade the network-operator release:

  • Upgrade the CRD to the latest version
  • Apply helm chart update

Searching for Available Releases

To find available releases, run:

$ helm search repo mellanox/network-operator -l

Add the --devel option if you wish to list beta releases as well.

Downloading CRDs for the Specific Release

It is possible to retrieve updated CRDs from the Helm chart or from the release branch on GitHub. The example below shows how to download and unpack Helm chart for a specified release, and apply CRDs update from it.

$ helm pull mellanox/network-operator --version <VERSION> --untar --untardir network-operator-chart

The --devel option is required if you wish to use the beta release.

$ kubectl apply \
  -f network-operator-chart/network-operator/crds \
  -f network-operator-chart/network-operator/charts/sriov-network-operator/crds

Preparing the Helm Values for the New Release

Download the Helm values for the specific release: 

Edit the values-<VERSION>.yaml file as required for your cluster. The network operator has some limitations as to which updates in the NicClusterPolicy it can handle automatically. If the configuration for the new release is different from the current configuration in the deployed release, some additional manual actions may be required.

Known limitations:

  • If component configuration was removed from the NicClusterPolicy, manual clean up of the component's resources (DaemonSets, ConfigMaps, etc.) may be required.
  • If the configuration for devicePlugin changed without image upgrade, manual restart of the devicePlugin may be required.

These limitations will be addressed in future releases.

Changes that were made directly in the NicClusterPolicy CR (e.g. with kubectl edit) will be overwritten by the Helm upgrade.

Temporarily Disabling the Network-operator

This step is required to prevent the old network-operator version from handling the updated NicClusterPolicy CR. This limitation will be removed in future network-operator releases.

$ kubectl scale deployment --replicas=0 -n network-operator network-operator

Please wait for the network-operator POD to be removed before proceeding.

The network-operator will be automatically enabled by the helm upgrade command. There is no need to enable it manually.

Applying the Helm Chart Update

To apply the helm chart update, run:

$ helm upgrade -n network-operator  network-operator mellanox/network-operator --version=<VERSION> -f values-<VERSION>.yaml

  The --devel option is required if you wish to use the beta release.

Restarting PODs with Containerized OFED Driver

This operation is required only if containerized OFED is in use.

When a containerized OFED driver is reloaded on the node, all PODs that use secondary network based on NVIDIA Mellanox NICs will lose network interface in their containers. To prevent outage, remove all PODs that use secondary network from the node before you reload the driver POD on it.

The helm upgrade command will only upgrade the DaemonSet spec of the OFED driver to point to the new driver version. The OFED driver's DaemonSet will not automatically restart PODs with the driver on the nodes because it uses "OnDelete" updateStrategy. The old OFED version will still run on the node until you explicitly remove the driver POD or reboot the node:

$ kubectl delet pod -l app=mofed-<OS_NAME> -n nvidia-network-operator-resources

It is possible to remove all PODs with secondary networks from all cluster nodes, and then restart OFED PODs on all nodes at once.

The alternative option is to perform an upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver POD restart can be done on each node individually. In this case, PODs with secondary networks should be removed from the single node only. There is no need to stop PODs on all nodes.

For each node follow these steps to reload the driver on the node:

  1. Remove PODs with secondary network from the node.
  2. Restart the OFED driver POD.
  3. Return the PODs with a secondary network to the node.

When the OFED driver is ready, proceed with the same steps for other nodes.

Removing PODs with a Secondary Network from the Node

To remove PODs with a secondary network from the node with node drain, run the following command:

$ kubectl drain <NODE_NAME> --pod-selector=<SELECTOR_FOR_PODS>

Replace <NODE_NAME> with -l "network.nvidia.com/operator.mofed.wait=false" if you wish to drain all nodes at once.

Restarting the OFED Driver POD

Find the OFED driver POD name for the node:

$ kubectl get pod -l app=mofed-<OS_NAME> -o wide -A

Example for Ubuntu 20.04:

kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A

Deleting the OFED Driver POD from the Node

To delete the OFED driver POD from the node, run:

$ kubectl delete pod -n <DRIVER_NAMESPACE> <OFED_POD_NAME>

Replace <OFED_POD_NAME> with -l app=mofed-ubuntu20.04 if you wish to remove OFED PODs on all nodes at once.

A new version of the OFED POD will automatically start.

Returning PODs with a Secondary Network to the Node

After the OFED POD is ready on the node, you can make the node schedulable again.

The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule taint) the node and return the PODs to it:

$ kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false"

Deployment Examples 

Since several parameters should be provided when creating custom resources during operator deployment, it is recommended to use a configuration file. While it is possible to override the parameters via CLI, it would be cumbersome, and therefore, not recommended.

Below are deployment examples, which the values.yaml file provided to the Helm during the installation of the network operator. This was achieved by running:

$ helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator network-operator

Network Operator Deployment with the RDMA Shared Device Plugin

Network operator deployment with the default version of the OFED driver and a single RDMA resource mapped to enp1 netdev.:

values.yaml configuration file for such a deployment:

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: false
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: true
   
nvPeerDriver:
  deploy: false
   
rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      devices: [enp1]
 
sriovDevicePlugin:
  deploy: false

Network Operator Deployment with Multiple Resources in RDMA Shared Device Plugin

Network Operator deployment with the default version of OFED and an RDMA device plugin with two RDMA resources. The first is mapped to enp1 and enp2, and the second is mapped to enp3.

values.yaml configuration file for such a deployment:

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: false
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: true
   
nvPeerDriver:
  deploy: false
   
rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      devices: [enp1, enp2]
    - name: rdma_shared_device_b
      devices: [enp3, enp4]
 
sriovDevicePlugin:
  deploy: false

Network Operator Deployment with a Secondary Network 

Network Operator deployment with:

  • RDMA shared device plugin
  • Secondary network
  • Mutlus CNI
  • Containernetworking-plugins CNI plugins
  • Whereabouts IPAM CNI Plugin

values.yaml:

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: false
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: false

rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      devices: [enp1]
secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

Network Operator Deployment with a Host Device Network 

Network operator deployment with:

  • SR-IOV device plugin, single SR-IOV resource pool
  • Secondary network
  • Mutlus CNI
  • Containernetworking-plugins CNI plugins
  • Whereabouts IPAM CNI plugin

In this mode, the Network Operator could be deployed on virtualized deployments as well. It supports both Ethernet and InfiniBand modes. From the Network Operator perspective, there is no difference between the deployment procedures. To work on a VM (virtual machine), the PCI passthrough must be configured for SR-IOV devices. The Network Operator works both with VF (Virtual Function) and PF (Physical Function) inside the VMs.

values.yaml:

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: false
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: false

rdmaSharedDevicePlugin:
  deploy: false

sriovDevicePlugin:
  deploy: true
  resources:
    - name: hostdev
      vendors: [15b3]
secondaryNetwork:
  deploy: true
  multus:
    deploy: true
    image: multus
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

After deployment, the network operator should be configured, and K8s networking is deployed in order to use it in pod configuration.
host-device-net.yaml configuration file for such a deployment:

apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
  name: hostdev-net
spec:
  networkNamespace: "default"
  resourceName: "nvidia.com/hostdev"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.3.225/28",
      "exclude": [
       "192.168.3.229/30",
       "192.168.3.236/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info"
    }

pod.yaml configuration file for such a deployment:

kind: Pod
metadata:
  name: hostdev-test-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
  restartPolicy: OnFailure
  containers:
  - image: <rdma image>
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      requests:
        nvidia.com/hostdev: 1
      limits:
        nvidia.com/hostdev: 1
    command:
    - sh
    - -c
    - sleep inf

Network Operator Deployment for GPUDirect Workloads

GPUDirect requires the following:

  • MOFED 5.5-1.0.3.2 or newer
  • GPU Operator 1.9.0 or newer
  • NVIDIA GPU and driver supporting GPUDirect e.g Quadro RTX 6000/8000 or NVIDIA T4/NVIDIA V100/NVIDIA A100

values.yaml example: 

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: false
# NicClusterPolicy CR values:
ofedDriver:
  deploy: true
deployCR: true

sriovDevicePlugin:
  deploy: true
  resources:
    - name: hostdev
      vendors: [15b3]

secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

host-device-net.yaml:

apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
  name: sriov-net
spec:
  networkNamespace: "default"
  resourceName: "hostdev"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.3.225/28",
      "exclude": [
       "192.168.3.229/30",
       "192.168.3.236/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info"
    }

host-net-gpudirect-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-net
spec:
  containers:
  - name: appcntr1
    image: <image>
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    command:
      - sh
      - -c
      - sleep inf
    resources:
      requests:
        nvidia.com/hostdev: '1'
        nvidia.com/gpu: '1'
      limits:
        nvidia.com/hostdev: '1'
        nvidia.com/gpu: '1'

Network Operator Deployment in SR-IOV Legacy Mode

The SR-IOV Network Operator will be deployed with the default configuration. You can override these settings using a CLI argument, or the ‘sriov-network-operator’ section in the values.yaml file. For more information, refer to the Project Documentation.

This deployment mode supports DPDK applications. In order to run DPDK applications, HUGEPAGE should be configured on the required K8s Worker Nodes. By default, the inbox operating system driver is used. For support of cases with specific requirements, OFED container should be deployed.
values.yaml configuration file for such a deployment: 

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: true
 
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: true
rdmaSharedDevicePlugin:
  deploy: false
sriovDevicePlugin:
  deploy: false

secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

Following the deployment, the Network Operator should be configured, and sriovnetwork node policy and K8s networking should be deployed. 
sriovnetwork-node-policy.yaml configuration file for such a deployment:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-1
  namespace: network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens2f0"]
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: rdma_network

sriovnetwork.yaml configuration file for such a deployment:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: "rdma-network"
  namespace: network-operator
spec:
  vlan: 0
  networkNamespace: "default"
  resourceName: "rdma_network"
  ipam: |-
    {
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "log_file": "/tmp/whereabouts.log",
      "log_level": "debug",
      "type": "whereabouts",
      "range": "192.168.101.0/24"
  }

The ens2f0 network interface name has been chosen from the following command output:
kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io.

...
 
status:
  interfaces:
  - deviceID: 101d
    driver: mlx5_core
    linkSpeed: 100000 Mb/s
    linkType: ETH
    mac: 0c:42:a1:2b:74:ae
    mtu: 1500
    name: ens2f0
    pciAddress: "0000:07:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 101d
    driver: mlx5_core
    linkType: ETH
    mac: 0c:42:a1:2b:74:af
    mtu: 1500
    name: ens2f1
    pciAddress: "0000:07:00.1"
    totalvfs: 8
    vendor: 15b3
 
...

Wait for all required pods to be spawned:

# kubectl get pod -n network-operator | grep sriov
network-operator-sriov-network-operator-544c8dbbb9-vzkmc          1/1     Running   0          5d
sriov-cni-qgblf                                                   2/2     Running   0          2d6h
sriov-device-plugin-vwpzn                                         1/1     Running   0          2d6h
sriov-network-config-daemon-qv467                                 1/1     Running   0          5d
 
# kubectl get pod -n nvidia-network-operator-resources
NAME                                            READY   STATUS    RESTARTS   AGE
cni-plugins-ds-kbvnm                            1/1     Running   0          5d
cni-plugins-ds-pcllg                            1/1     Running   0          5d
kube-multus-ds-5j6ns                            1/1     Running   0          5d
kube-multus-ds-mxgvl                            1/1     Running   0          5d
mofed-ubuntu20.04-ds-2zzf4                      1/1     Running   0          5d
mofed-ubuntu20.04-ds-rfnsw                      1/1     Running   0          5d
whereabouts-nw7hn                               1/1     Running   0          5d
whereabouts-zvhrv                               1/1     Running   0          5d

pod.yaml configuration file for such a deployment:

apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: rdma-network
spec:
  containers:
  - name: appcntr1
    image: <image>
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    volumeMounts:
      - mountPath: /dev/hugepages
        name: hugepage
    resources:
      requests:
        cpu: 2
        memory: 4Gi
        hugepages-1Gi: 2Gi
        nvidia.com/rdma_network: '1'
    command:
    - sh
    - -c
    - sleep inf
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

Network Operator Deployment with InfiniBand Network

Network operator deployment with InfiniBand network requires the following:

  • MOFED and OpenSM running. OpenSM runs on top of the MOFED stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify a defmember=full to enable the SR-IOV functionality over InfiniBand. For details, please, refer to this documentation article.
  • InfiniBand device – Both host device and switch ports must be enabled in InfiniBand mode.

values.yaml

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: true
 
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: true
rdmaSharedDevicePlugin:
  deploy: false
sriovDevicePlugin:
  deploy: false

secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

sriov-ib-network-node-policy.yaml:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: infiniband-sriov
  namespace: network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: mlnxnics

sriov-ib-network.yaml:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: example-sriov-ib-network
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.5.225/28",
      "exclude": [
       "192.168.5.229/30",
       "192.168.5.236/32"
      ],
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: mlnxnics
  linkState: enable
  networkNamespace: default

sriov-ib-network-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: test-sriov-ib-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: example-sriov-ib-network 
spec:
  containers:
    - name: test-sriov-ib-pod
      image: centos/tools
      imagePullPolicy: IfNotPresent
      command:
        - sh
        - -c
        - sleep inf
      securityContext:
        capabilities:
          add: [ "IPC_LOCK" ]
      resources:
        requests:
          nvidia.com/mlnxics: "1"
        limits:
          nvidia.com/mlnxics: "1"

Network Operator Deployment for DPDK Workloads with NicClusterPolicy

Network Operator deployment with:

  • Host Device Network, DPDK pod

nicclusterpolicy.yaml:

apiVersion: mellanox.com/v1alpha1 
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: mofed
    repository: mellanox
    version: 5.5-1.0.3.2
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: a765300344368efbf43f71016e9641c58ec1241b
    config: |
      {
        "resourceList": [
            {
                "resourcePrefix": "nvidia.com",
                "resourceName": "rdma_host_dev",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["1018"],
                    "drivers": ["mlx5_core"]
                }
            }
        ]
      }
  psp:
    enabled: false
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: ghcr.io/k8snetworkplumbingwg
      version: v0.8.7-amd64
    ipamPlugin:
      image: whereabouts
      repository: ghcr.io/k8snetworkplumbingwg
      version: v0.4.2-amd64
    multus:
      image: multus-cni
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.8  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: ghcr.io/k8snetworkplumbingwg
      version: v0.8.7-amd64

host-device-net.yaml:

apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
  name: sriov-net
spec:
  networkNamespace: "default"
  resourceName: "rdma_host_dev"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.3.225/28",
      "exclude": [
       "192.168.3.229/30",
       "192.168.3.236/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info"
    }

 pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-net
spec:
  containers:
  - name: appcntr1
    image: <dpdk image>
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
          add: ["IPC_LOCK"]
    volumeMounts:
      - mountPath: /dev/hugepages
        name: hugepage
    resources:
      requests:
        memory: 1Gi
        hugepages-1Gi: 2Gi
        nvidia.com/rdma_host_dev: '1'
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
  volumes:
   - name: hugepage
     emptyDir:
       medium: HugePages

NicClusterPolicy CRD

For more information on NicClusterPolicy custom resource, please refer to the Network-Operator Project Documentation.

MacVlanNetwork CRD

For more information on MacVlanNetwork custom resource, please refer to the Network-Operator Project Documentation.

Ensuring Deployment Readiness 

Once The network operator is deployed and a NicClusterPolicy resource is created, the operator will reconcile the state of the cluster until it reaches the desired state, as defined in the resource.

Alignment of the cluster to the defined policy can be verified in the custom resource status.

a "Ready" state indicates that the required components were deployed, and that the policy is applied on the cluster.

Example Status Field of a NICClusterPolicy Instance

Status:
  Applied States:
    Name:   state-OFED
    State:  ready
    Name:   state-RDMA-device-plugin
    State:  ready
    Name:   state-NV-Peer
    State:  ignore
    Name:   state-cni-plugins
    State:  ignore
    Name:   state-Multus
    State:  ready
    Name:   state-whereabouts
    State:  ready
  State:    ready

An "Ignore" state indicates that the sub-state was not defined in the custom resource, and thus, it is ignored.