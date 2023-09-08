Warning Please note that this step is optional. However, if you wish to configure uplink representor mode, make sure this step is performed before configuring SwitchDev.

The following are the uplink representor modes available for configuration

new_netdev : default mode - when found in this mode, the uplink representor is created as a new netdevice

nic_netdev: when found in this mode, the NIC netdevice acts as an uplink representor device

Example:

Copy Copied! echo nic_netdev > /sys/ class /net/ens1f0/compat/devlink/uplink_rep_mode

Notes:

The mode can only be changed when found in Legacy mode

The mode is not saved when reloading mlx5_core

When two PFs in the same bonding device need to enter the SwitchDev mode, the uplink representor mode for both PFs should be same (either nic_netdev or new_netdev)

Unbind the VFs. Copy Copied! echo 0000 : 04 : 00.2 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000 : 04 : 00.3 > /sys/bus/pci/drivers/mlx5_core/unbind Warning VMs with attached VFs must be powered off to be able to unbind the VFs. Change the e-switch mode from legacy to switchdev on the PF device.

This will also create the VF representor netdevices in the host OS. Copy Copied! # echo switchdev > /sys/ class /net/enp4s0f0/compat/devlink/mode Warning Before changing the mode, make sure that all VFs are unbound. Warning To go back to SR-IOV legacy mode:

# echo legacy > /sys/class/net/enp4s0f0/compat/devlink/mode

Running this command, will also remove the VF representor netdevices. Set the network VF representor device names to be in the form of $PF_$VFID where $PF is the PF netdev name, and $VFID is the VF ID=0,1,[..] , either by:

* Using this rule in /etc/udev/rules.d/82-net-setup-link.rules Copy Copied! SUBSYSTEM== "net" , ACTION== "add" , ATTR{phys_switch_id}== "e41d2d60971d" , \ ATTR{phys_port_name}!= "" , NAME= "enp4s0f1_$attr{phys_port_name}" Replace the phys_switch_id value ("e41d2d60971d" above) with the value matching your switch, as obtained from: Copy Copied! ip -d link show enp4s0f1 Example output of device names when using the udev rule: Copy Copied! ls -l /sys/ class /net/ens4* lrwxrwxrwx 1 root root 0 Mar 27 17 : 14 enp4s0f0 -> ../../devices/pci0000: 00 / 0000 : 00 : 03.0 / 0000 : 04 : 00.0 /net/enp4s0f0 lrwxrwxrwx 1 root root 0 Mar 27 17 : 15 enp4s0f0_0 -> ../../devices/virtual/net/enp4s0f0_0 lrwxrwxrwx 1 root root 0 Mar 27 17 : 15 enp4s0f0_1 -> ../../devices/virtual/net/enp4s0f0_1 * Using the supplied 82-net-setup-link.rules and vf-net-link-name.sh script to set the VF representor device names.

From the scripts directory copy vf-net-link-name.sh to /etc/udev/ and 82-net-setup-link.rules to /etc/udev/rules.d/ .

Make sure vf-net-link-name.sh is executable. Run the openvswitch service. Copy Copied! # systemctl start openvswitch Create an OVS bridge (here it's named ovs-sriov). Copy Copied! # ovs-vsctl add-br ovs-sriov Enable hardware offload (disabled by default). Copy Copied! # ovs-vsctl set Open_vSwitch . other_config:hw-offload= true Restart the openvswitch service. This step is required for HW offload changes to take effect. Copy Copied! # systemctl restart openvswitch Warning HW offload policy can also be changed by setting the tc-policy using one on the following values: * none - adds a TC rule to both the software and the hardware (default) * skip_sw - adds a TC rule only to the hardware * skip_hw - adds a TC rule only to the software The above change is used for debug purposes. Add the PF and the VF representor netdevices as OVS ports. Copy Copied! # ovs-vsctl add-port ovs-sriov enp4s0f0 # ovs-vsctl add-port ovs-sriov enp4s0f0_0 # ovs-vsctl add-port ovs-sriov enp4s0f0_1 Make sure to bring up the PF and representor netdevices. Copy Copied! # ip link set dev enp4s0f0 up # ip link set dev enp4s0f0_0 up # ip link set dev enp4s0f0_1 up The PF represents the uplink (wire). Copy Copied! # ovs-dpctl show system @ovs -system: lookups: hit: 0 missed: 192 lost: 1 flows: 2 masks: hit: 384 total: 2 hit/pkt: 2.00 port 0 : ovs-system (internal) port 1 : ovs-sriov (internal) port 2 : enp4s0f0 port 3 : enp4s0f0_0 port 4 : enp4s0f0_1 Run traffic from the VFs and observe the rules added to the OVS data-path. Copy Copied! # ovs-dpctl dump-flows recirc_id( 0 ),in_port( 3 ),eth(src=e4: 11 : 22 : 33 : 44 : 50 ,dst=e4:1d:2d:a5:f3:9d), eth_type( 0x0800 ),ipv4(frag=no), packets: 33 , bytes: 3234 , used: 1 .196s, actions: 2 recirc_id( 0 ),in_port( 2 ),eth(src=e4:1d:2d:a5:f3:9d,dst=e4: 11 : 22 : 33 : 44 : 50 ), eth_type( 0x0800 ),ipv4(frag=no), packets: 34 , bytes: 3332 , used: 1 .196s, actions: 3 In the example above, the ping was initiated from VF0 (OVS port 3) to the outer node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is e4:1d:2d:a5:f3:9d

As shown above, two OVS rules were added, one in each direction.

Note that you can also verify offloaded packets using by adding type=offloaded to the command. For example: Copy Copied! # ovs-dpctl dump-flows type=offloaded

The aging timeout of OVS is given is ms and can be controlled with this command:

Copy Copied! # ovs-vsctl set Open_vSwitch . other_config:max-idle= 30000





It is common to require the VM traffic to be tagged by the OVS. Such that, the OVS adds tags (vlan push) to the packets sent by the VMs and strips (vlan pop) the packets received for this VM from other nodes/VMs.

To do so, add a tag=$TAG section for the OVS command line that adds the representor ports, for example here we use vlan ID 52.

Copy Copied! # ovs-vsctl add-port ovs-sriov enp4s0f0 # ovs-vsctl add-port ovs-sriov enp4s0f0_0 tag= 52 # ovs-vsctl add-port ovs-sriov enp4s0f0_1 tag= 52

The PF port should not have a VLAN attached. This will cause OVS to add VLAN push/pop actions when managing traffic for these VFs.

To see how the OVS rules look with vlans, here we initiated a ping from VF0 (OVS port 3) to an outer node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is 00:02:c9:e9:bb:b2 .

At this stage, we can see that two OVS rules were added, one in each direction.

Copy Copied! recirc_id( 0 ),in_port( 3 ),eth(src=e4: 11 : 22 : 33 : 44 : 50 ,dst= 00 : 02 :c9:e9:bb:b2),eth_type( 0x0800 ),ipv4(frag=no), \ packets: 0 , bytes: 0 , used:never, actions:push_vlan(vid= 52 ,pcp= 0 ), 2 recirc_id( 0 ),in_port( 2 ),eth(src= 00 : 02 :c9:e9:bb:b2,dst=e4: 11 : 22 : 33 : 44 : 50 ),eth_type( 0x8100 ), \ vlan(vid= 52 ,pcp= 0 ),encap(eth_type( 0x0800 ),ipv4(frag=no)), packets: 0 , bytes: 0 , used:never, actions:pop_vlan, 3

For outgoing traffic (in port = 3), the actions are push vlan (52) and forward to port 2

For incoming traffic (in port = 2), matching is done also on vlan, and the actions are pop vlan and forward to port 3

Warning VXLAN encapsulation / decapsulation offloading of OVS actions is supported only in ConnectX-5 adapter cards.

In case of offloading VXLAN, the PF should not be added as a port in the OVS data-path but rather be assigned with the IP address to be used for encapsulation.

The example below shows two hosts (PFs) with IPs 1.1.1.177 and 1.1.1.75 , where the PF device on both hosts is enp4s0f0 and the VXLAN tunnel is set with VNID 98:

On the first host: Copy Copied! # ip addr add 1.1 . 1.177 / 24 dev enp4s0f1 # ovs-vsctl add-port ovs-sriov vxlan0 -- set interface vxlan0 type=vxlan options:local_ip= 1.1 . 1.177 options:remote_ip= 1.1 . 1.75 options:key= 98

On the second host: Copy Copied! # ip addr add 1.1 . 1.75 / 24 dev enp4s0f1 # ovs-vsctl add-port ovs-sriov vxlan0 -- set interface vxlan0 type=vxlan options:local_ip= 1.1 . 1.75 options:remote_ip= 1.1 . 1.177 options:key= 98

When encapsulating guest traffic, the VF’s device MTU must be reduced to allow the host/HW add the encap headers without fragmenting the resulted packet. As such, the VF’s MTU must be lowered to 1450 for IPv4 and 1430 for IPv6.

To see how the OVS rules look with vxlan encap/decap actions, here we initiated a ping from a VM on the 1st host whose MAC is e4:11:22:33:44:50 to a VM on the 2nd host whose MAC is 46:ac:d1:f1:4c:af

At this stage we see that two OVS rules were added to the first host; one in each direction.

Copy Copied! # ovs-dpctl show system @ovs -system: lookups: hit: 7869 missed: 241 lost: 2 flows: 2 masks: hit: 13726 total: 10 hit/pkt: 1.69 port 0 : ovs-system (internal) port 1 : ovs-sriov (internal) port 2 : vxlan_sys_4789 (vxlan) port 3 : enp4s0f1_0 port 4 : enp4s0f1_1 # ovs-dpctl dump-flows recirc_id( 0 ),in_port( 3 ),eth(src=e4: 11 : 22 : 33 : 44 : 50 ,dst= 46 :ac:d1:f1:4c:af),eth_type( 0x0800 ),ipv4(tos= 0 / 0x3 ,frag=no), packets: 4 , bytes: 392 , used: 0 .664s, actions:set(tunnel(tun_id= 0x62 ,dst= 1.1 . 1.75 ,ttl= 64 ,flags(df,key))), 2 recirc_id( 0 ),tunnel(tun_id= 0x62 ,src= 1.1 . 1.75 ,dst= 1.1 . 1.177 ,ttl= 64 ,flags(-df-csum+key)), in_port( 2 ),skb_mark( 0 ),eth(src= 46 :ac:d1:f1:4c:af,dst=e4: 11 : 22 : 33 : 44 : 50 ),eth_type( 0x0800 ),ipv4(frag=no), packets: 5 , bytes: 490 , used: 0 .664s, actions: 3

For outgoing traffic (in port = 3), the actions are set vxlan tunnel to host 1.1.1.75 (encap) and forward to port 2

For incoming traffic (in port = 2), matching is done also on vxlan tunnel info which was decapsulated, and the action is forward to port 3

Offloading rules can also be added directly, and not just through OVS, using the tc utility.

To enable TC ingress on both the PF and the VF.

Copy Copied! # tc qdisc add dev enp4s0f0 ingress # tc qdisc add dev enp4s0f0_0 ingress # tc qdisc add dev enp4s0f0_1 ingress

Copy Copied! # tc filter add dev ens4f0_0 protocol ip parent ffff: \ flower \ skip_sw \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ src_mac e4: 11 : 22 : 11 :4a: 50 \ action drop





Copy Copied! # tc filter add dev ens4f0_0 protocol 802 .1Q parent ffff: \ flower \ skip_sw \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ src_mac e4: 11 : 22 : 11 :4a: 50 \ action vlan push id 100 \ action mirred egress redirect dev ens4f0 # tc filter add dev ens4f0 protocol 802 .1Q parent ffff: \ flower \ skip_sw \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ src_mac e4: 11 : 22 : 11 :4a: 50 \ vlan_ethtype 0x800 \ vlan_id 100 \ vlan_prio 0 \ action vlan pop \ action mirred egress redirect dev ens4f0_0





Copy Copied! # tc filter add dev ens4f0_0 protocol 0x806 parent ffff: \ flower \ skip_sw \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ src_mac e4: 11 : 22 : 11 :4a: 50 \ action tunnel_key set \ src_ip 20.1 . 12.1 \ dst_ip 20.1 . 11.1 \ id 100 \ action mirred egress redirect dev vxlan100 # tc filter add dev vxlan100 protocol 0x806 parent ffff: \ flower \ skip_sw \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ src_mac e4: 11 : 22 : 11 :4a: 50 \ enc_src_ip 20.1 . 11.1 \ enc_dst_ip 20.1 . 12.1 \ enc_key_id 100 \ enc_dst_port 4789 \ action tunnel_key unset \ action mirred egress redirect dev ens4f0_0

Bond rules can be added in one of the following methods:

Using shared block (requires kernel support): Copy Copied! # tc qdisc add dev bond0 ingress_block 22 ingress # tc qdisc add dev ens4p0 ingress_block 22 ingress # tc qdisc add dev ens4p1 ingress_block 22 ingress Add drop rule: Copy Copied! # tc filter add block 22 protocol arp parent ffff: prio 3 \ flower \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ action drop Add redirect rule from bond to representor: Copy Copied! # tc filter add block 22 protocol arp parent ffff: prio 3 \ flower \ dst_mac e4: 11 : 22 : 11 :4a: 50 \ action mirred egress redirect dev ens4f0_0 Add redirect rule from representor to bond: Copy Copied! # tc filter add dev ens4f0_0 protocol arp parent ffff: prio 3 \ flower \ dst_mac ec:0d:9a:8a: 28 : 42 \ action mirred egress redirect dev bond0

Without using shared block: Add redirect rule from bond to representor: Copy Copied! # tc filter add dev bond0 protocol arp parent ffff: prio 1 \ flower \ dst_mac e4: 11 : 22 : 11 :4a: 50 \ action mirred egress redirect dev ens4f0_0 Add redirect rule from representor to bond: Copy Copied! # tc filter add dev ens4f0_0 protocol arp parent ffff: prio 3 \ flower \ dst_mac ec:0d:9a:8a: 28 : 42 \ action mirred egress redirect dev bond0



VLAN Modify rules can be added in one of the following methods:

Copy Copied! tc filter add dev $REP_DEV1 protocol 802 .1q ingress prio 1 flower \ vlan_id 10 \ action vlan modify id 11 pipe \ action mirred egress redirect dev $REP_DEV2

Copy Copied! tc filter add dev $DEV_REP1 protocol 802 .1q ingress prio 1 flower \ vlan_id 10 \ action vlan pop pipe action vlan push id 11 pipe \ action mirred egress redirect dev $REP_DEV2

SR-IOV VF LAG allows the NIC’s physical functions (PFs) to get the rules that the OVS will try to offload to the bond net-device, and to offload them to the hardware e-switch. Bond modes supported are:

Active-Backup

XOR

LACP

SR-IOV VF LAG enables complete offload of the LAG functionality to the hardware. The bonding creates a single bonded PF port. Packets from up-link can arrive from any of the physical ports, and will be forwarded to the bond device.

When hardware offload is used, packets from both ports can be forwarded to any of the VFs. Traffic from the VF can be forwarded to both ports according to the bonding state. Meaning, when in active-backup mode, only one PF is up, and traffic from any VF will go through this PF. When in XOR or LACP mode, if both PFs are up, traffic from any VF will split between these two PFs.

To enable SR-IOV VF LAG, both physical functions of the NIC should first be configured to SR-IOV SwitchDev mode, and only afterwards bond the up-link representors.

The example below shows the creation of bond interface on two PFs:

Load bonding device and enslave the up-link representor (currently PF) net-device devices. Copy Copied! modprobe bonding mode= 802 .3ad Ifup bond0 (make sure ifcfg file is present with desired bond configuration) ip link set enp4s0f0 master bond0 ip link set enp4s0f1 master bond0 Add the VF representor net-devices as OVS ports. If tunneling is not used, add the bond device as well. Copy Copied! ovs-vsctl add-port ovs-sriov bond0 ovs-vsctl add-port ovs-sriov enp4s0f0_0 ovs-vsctl add-port ovs-sriov enp4s0f1_0 Make sure to bring up the PF and the representor netdevices. Copy Copied! ip link set dev bond0 up ip link set dev enp4s0f0_0 up ip link set dev enp4s0f1_0 up

Warning Once SR-IOV VF LAG is configured, all VFs of the two PFs will become part of the bond, and will behave as described above.





In VF LAG mode, outgoing traffic in load balanced mode is according to the origin ring, thus, half of the rings will be coupled with port 1 and half with port 2. All the traffic on the same ring will be sent from the same port.

VF LAG configuration is not supported when the NUM_OF_VFS configured in mlxconfig is higher than 64.

OVS-kernel supports offload of vlan header push/pop actions.

Push - pushing of vlan header is supported on Tx

Pop - popping of tunnel header is supported on Rx

Add a tag=$TAG section for the OVS command line that adds the representor ports. For example, VLAN ID 52 is being used here.

Copy Copied! # ovs-vsctl add-port ovs-sriov enp4s0f0 # ovs-vsctl add-port ovs-sriov enp4s0f0_0 tag= 52 # ovs-vsctl add-port ovs-sriov enp4s0f0_1 tag= 52

The PF port should not have a VLAN attached. This will cause OVS to add VLAN push/pop actions when managing traffic for these VFs.

Dump Flow Example

Copy Copied! recirc_id( 0 ),in_port( 3 ),eth(src=e4: 11 : 22 : 33 : 44 : 50 ,dst= 00 : 02 :c9:e9:bb:b2),eth_type( 0x0800 ),ipv4(frag=no), \ packets: 0 , bytes: 0 , used:never, actions:push_vlan(vid= 52 ,pcp= 0 ), 2 recirc_id( 0 ),in_port( 2 ),eth(src= 00 : 02 :c9:e9:bb:b2,dst=e4: 11 : 22 : 33 : 44 : 50 ),eth_type( 0x8100 ), \ vlan(vid= 52 ,pcp= 0 ),encap(eth_type( 0x0800 ),ipv4(frag=no)), packets: 0 , bytes: 0 , used:never, actions:pop_vlan, 3

VLAN Offload using TC Rules Example

Copy Copied! # tc filter add dev ens4f0_0 protocol ip parent ffff: \ flower \ skip_sw \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ src_mac e4: 11 : 22 : 11 :4a: 50 \ action vlan push id 100 \ action mirred egress redirect dev ens4f0 # tc filter add dev ens4f0 protocol 802 .1Q parent ffff: \ flower \ skip_sw \ dst_mac e4: 11 : 22 : 11 :4a: 51 \ src_mac e4: 11 : 22 : 11 :4a: 50 \ vlan_ethtype 0x800 \ vlan_id 100 \ vlan_prio 0 \ action vlan pop \ action mirred egress redirect dev ens4f0_0

Warning Port Mirroring is currently supported in ConnectX-5 adapter cards only.

Unlike para-virtual configurations, when the VM traffic is offloaded to the hardware via SR-IOV VF, the host side Admin cannot snoop the traffic (e.g. for monitoring).

ASAP² uses the existing mirroring support in OVS and TC along with the enhancement to the offloading logic in the driver to allow mirroring the VF traffic to another VF.

The mirrored VF can be used to run traffic analyzer (tcpdump, wireshark, etc) and observe the traffic of the VF being mirrored.

The example below shows the creation of port mirror on the following configuration:

Copy Copied! # ovs-vsctl show 09d8a574-9c39-465c-9f16-47d81c12f88a Bridge br-vxlan Port "enp4s0f0_1" Interface "enp4s0f0_1" Port "vxlan0" Interface "vxlan0" type: vxlan options: {key= "100" , remote_ip= "192.168.1.14" } Port "enp4s0f0_0" Interface "enp4s0f0_0" Port "enp4s0f0_2" Interface "enp4s0f0_2" Port br-vxlan Interface br-vxlan type: internal ovs_version: "2.8.90"

If we want to set enp4s0f0_0 as the mirror port, and mirror all of the traffic, set it as follow: Copy Copied! # ovs-vsctl -- --id= @p get port enp4s0f0_0 \ -- --id= @m create mirror name=m0 select-all= true output-port= @p \ -- set bridge br-vxlan mirrors= @m

If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic, the destination is enp4s0f0_1, set it as follow: Copy Copied! # ovs-vsctl -- --id= @p1 get port enp4s0f0_0 \ -- --id= @p2 get port enp4s0f0_1 \ -- --id= @m create mirror name=m0 select-dst-port= @p2 output-port= @p1 \ -- set bridge br-vxlan mirrors= @m

If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic the source is enp4s0f0_1, set it as follow: Copy Copied! # ovs-vsctl -- --id= @p1 get port enp4s0f0_0 \ -- --id= @p2 get port enp4s0f0_1 \ -- --id= @m create mirror name=m0 select-src-port= @p2 output-port= @p1 \ -- set bridge br-vxlan mirrors= @m

If we want to set enp4s0f0_0 as the mirror port and mirror, all the traffic on enp4s0f0_1, set it as follow: Copy Copied! # ovs-vsctl -- --id= @p1 get port enp4s0f0_0 \ -- --id= @p2 get port enp4s0f0_1 \ -- --id= @m create mirror name=m0 select-dst-port= @p2 select-src-port= @p2 output-port= @p1 \ -- set bridge br-vxlan mirrors= @m

To clear the mirror port:

Copy Copied! # ovs-vsctl clear bridge br-vxlan mirrors

Offloaded flows (including connection tracking) are added to virtual switch FDB flow tables. FDB tables have a set of flow groups. Each flow group saves the same traffic pattern flows. For example, for connection tracking offloaded flow, TCP and UDP are different traffic patterns which end up in two different flow groups.

A flow group has a limited size to save flow entries. By default, the driver has 4 big FDB flow groups. Each of these big flow groups can save at most 4000000/(4+1)=800k different 5-tuple flow entries. For scenarios with more than 4 traffic patterns, the driver provides a module parameter (num_of_groups) to allow customization and performance tune.

The size of each big flow group can be calculated according to the following formula.

Warning size = 4000000/(num_of_groups+1)

To change the number of big FDB flow groups, run:

Copy Copied! $ echo <num_of_groups> > /sys/module/mlx5_core/parameters/num_of_groups

The change takes effect immediately if there is no flow inside the FDB table (no traffic running and all offloaded flows are aged out), and it can be dynamically changed without reloading the driver.

The module parameter can be set statically in /etc/modprobe.d/mlnx.conf file. This way the administrator will not be required to set it via sysfs each time the driver is reloaded.

If there are residual offloaded flows when changing this parameter, then the new configuration only takes effect after all flows age out.

Warning Note that the default value of num_of_groups may change per MLNX_EN driver version. The following table lists the values that must be set when upgrading the MLNX_EN version prior to driver load, in order to achieve the same OOB experience. MLNX_EN Version num_of_groups Default Value v4.7-3.2.9.0 4 v4.6-3.1.9.0.14 15 v4.6-3.1.9.0.15 15 v4.5-1.0.1.0.19 63





This feature allows for monitoring traffic sent between two VMs on the same host using an sFlow collector.

The example below shows the creation of port mirroring on the following configuration.

Copy Copied! # ovs-vsctl show 09d8a574-9c39-465c-9f16-47d81c12f88a Bridge br-vxlan Port "enp4s0f0_1" Interface "enp4s0f0_1" Port "vxlan0" Interface "vxlan0" type: vxlan options: {key= "100" , remote_ip= "192.168.1.14" } Port "enp4s0f0_0" Interface "enp4s0f0_0" Port "enp4s0f0_2" Interface "enp4s0f0_2" Port br-vxlan Interface br-vxlan type: internal ovs_version: "2.14.1"

To sample all traffic over the OVS bridge:

Copy Copied! # ovs-vsctl -- --id= @sflow create sflow agent=\"$SFLOW_AGENT\" \ target=\"$SFLOW_TARGET:$SFLOW_PORT\" header=$SFLOW_HEADER \ sampling=$SFLOW_SAMPLING polling= 10 \ -- set bridge br-vxlan sflow= @sflow

Parameter Description SFLOW_AGENT Indicates that the sFlow agent should send traffic from SFLOW_AGENT’s IP address SFLOW_TARGET Remote IP address of the sFLOW collector SFLOW_PORT Size of packet header to sample (in bytes) SFLOW_SAMPLING Sample rate

To clear the sFLOW configuration:

Copy Copied! # ovs-vsctl clear bridge br-vxlan mirrors

sFLOW using TC:

Copy Copied! Sample to VF tc filter add dev $rep parent ffff: protocol arp pref 1 \ flower \ dst_mac e4:1d:2d:5d: 25 : 35 \ src_mac e4:1d:2d:5d: 25 : 34 \ action sample rate 10 group 5 trunc 96 \ action mirred egress redirect dev $NIC

Warning Userspace application is needed in order to process to sampled packet from the kernel. Example: https://github.com/Mellanox/libpsample





Metering for uplink and VF representors traffic support has been added.

Traffic going to a representor device can be a result of a miss in the embedded switch (eSwitch) FDB tables. This means that a packet which arrived from that representor into the eSwitch was not matched against the existing rules in the hardware FDB tables and needs to be forwarded to software to be handled there and is, therefore, forwarded to the originating representor device driver.

The meter allows to configure the max rate [packets/sec] and max burst [packets] for traffic going to the representor driver. Any traffic exceeding values provided by the user will be dropped in hardware. There are statistics that show number of dropped packets.

The configuration of a representors metering is done via a new sysfs called miss_rl_cfg .

Full path of the miss_rl_cfg parameter: / sys/class/net//rep_config/miss_rl_cfg

Usage: echo ”<rate> <burst>” > /sys/class/net//rep_config/miss_rl_cfg . Rate is the max rate of packets allowed for this representor (in packets/sec units) and burst is the max burst size allowed for this representor (in packets units). Both values must be specified. The default is 0 for both, meaning unlimited rate and burst.

To view the amount of packets and bytes that were dropped due to traffic exceeding the user-provided rate and burst, two read-only sysfs for statistics are exposed.

/sys/class/net//rep_config/miss_rl_dropped_bytes counts how many FDB-miss bytes were dropped due to reaching the miss limits

/sys/class/net//rep_config/miss_rl_dropped_packets counts how many FDB-miss packets were dropped due to reaching the miss limits

To configure OVS-DPDK HW offloads:

Unbind the VFs. Copy Copied! echo 0000 : 04 : 00.2 > /sys/bus/pci/drivers/mlx5_core/unbind echo 0000 : 04 : 00.3 > /sys/bus/pci/drivers/mlx5_core/unbind Note: VMs with attached VFs must be powered off to be able to unbind the VFs. Change the e-switch mode from Legacy to SwitchDev on the PF device (make sure all VFs are unbound). This will also create the VF representor netdevices in the host OS. Copy Copied! echo switchdev > /sys/ class /net/enp4s0f0/compat/devlink/mode To revert to SR-IOV Legacy mode: Copy Copied! echo legacy > /sys/ class /net/enp4s0f0/compat/devlink/mode Note that running this command will also result in the removal of the VF representor netdevices. Bind the VFs. Copy Copied! echo 0000 : 04 : 00.2 > /sys/bus/pci/drivers/mlx5_core/bind echo 0000 : 04 : 00.3 > /sys/bus/pci/drivers/mlx5_core/bind Run the Open vSwitch service. Copy Copied! systemctl start openvswitch Enable hardware offload (disabled by default). Copy Copied! ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init= true ovs-vsctl set Open_vSwitch . other_config:hw-offload= true Configure the DPDK white list. Copy Copied! ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra= "-w 0000:01:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=1" Restart the Open vSwitch service. This step is required for HW offload changes to take effect. Copy Copied! systemctl restart openvswitch Add PF to OVS. Copy Copied! ovs-vsctl add-port br0-ovs pf -- set Interface pf type=dpdk options:dpdk-devargs= 0000 : 88 : 00.0 Add representor to OVS. Copy Copied! ovs-vsctl add-port br0-ovs representor -- set Interface representor type=dpdk options:dpdk-devargs= 0000 : 88 : 00.0 ,representor=[$rep]

vSwitch in userspace rather than kernel-based Open vSwitch requires an additional bridge. The purpose of this bridge is to allow use of the kernel network stack for routing and ARP resolution.

The datapath needs to look-up the routing table and ARP table to prepare the tunnel header and transmit data to the output port.

Warning The configuration is done with: PF on 0000:03:00.0 PCI and MAC 98:03:9b:cc:21:e8

Local IP 56.56.67.1 - br-phy interface will be configured to this IP

Remote IP 56.56.68.1

To configure OVS-DPDK VXLAN:

Create a br-phy bridge. Copy Copied! ovs-vsctl add-br br-phy -- set Bridge br-phy datapath_type=netdev -- br-set-external-id br-phy bridge-id br-phy -- set bridge br-phy fail-mode=standalone other_config:hwaddr= 98 : 03 :9b:cc: 21 :e8 Attach PF interface to br-phy bridge. Copy Copied! ovs-vsctl add-port br-phy p0 -- set Interface p0 type=dpdk options:dpdk-devargs= 0000 : 03 : 00.0 Configure IP to the bridge. Copy Copied! ip addr add 56.56 . 67.1 / 24 dev br-phy Create a br-ovs bridge. Copy Copied! ovs-vsctl add-br br-ovs -- set Bridge br-ovs datapath_type=netdev -- br-set-external-id br-ovs bridge-id br-ovs -- set bridge br-ovs fail-mode=standalone Attach representor to br-ovs. Copy Copied! ovs-vsctl add-port br-ovs pf0vf0 -- set Interface pf0vf0 type=dpdk options:dpdk-devargs= 0000 : 03 : 00.0 ,representor=[ 0 ] Add a port for the VXLAN tunnel. Copy Copied! ovs-vsctl add-port ovs-sriov vxlan0 -- set interface vxlan0 type=vxlan options:local_ip= 56.56 . 67.1 options:remote_ip= 56.56 . 68.1 options:key= 45 options:dst_port= 4789

Connection tracking enables stateful packet processing by keeping a record of currently open connections.

OVS flows using connection tracking can be accelerated using advanced Network Interface Cards (NICs) by offloading established connections.

To view offloaded connections, run:

Copy Copied! ovs-appctl dpctl/offload-stats-show





To configure OVS-DPDK SR-IOV VF LAG:

Enable SR-IOV on the NICs. Copy Copied! mlxconfig -d <PCI> set SRIOV_EN= 1 Allocate the desired number of VFs per port. Copy Copied! echo $n > /sys/ class /net/<net name>/device/sriov_numvfs Unbind all VFs. Copy Copied! echo <VF PCI> >/sys/bus/pci/drivers/mlx5_core/unbind Change both NICs' mode to SwitchDev. Copy Copied! devlink dev eswitch set pci/<PCI> mode switchdev Create Linux bonding using kernel modules. Copy Copied! modprobe bonding mode=<desired mode> Note: Other bonding parameters can be added here. The supported Bond modes are: Active-Backup, XOR and LACP. Bring all PFs and VFs down. Copy Copied! ip link set <PF/VF> down Attach both PFs to the bond. Copy Copied! ip link set <PF> master bond0 To work with VF-LAG with OVS-DPDK, add the bond master (PF) to the bridge. Copy Copied! ovs-vsctl add-port br-phy p0 -- set Interface p0 type=dpdk options:dpdk-devargs= 0000 : 03 : 00.0 options:dpdk-lsc-interrupt= true Add representor $N of PF0 or PF1 to a bridge. Copy Copied! ovs-vsctl add-port br-phy rep$N -- set Interface rep$N type=dpdk options:dpdk-devargs=<PF0 PCI>,representor=pf0vf$N OR ovs-vsctl add-port br-phy rep$N -- set Interface rep$N type=dpdk options:dpdk-devargs=<PF0 PCI>,representor=pf1vf$N

Warning Hardware vDPA is supported on ConnectX-6 Dx, ConnectX-6 Lx & BlueField-2 cards and above only.

Warning Hardware vDPA is enabled by default. In case your hardware does not support vDPA, the driver will fall back to Software vDPA. To check which vDPA mode is activated on your driver, run: ovs-ofctl -O OpenFlow14 dump-ports br0-ovs and look for hw-mode flag.

Warning This feature has not been accepted to the OVS-DPDK Upstream yet, making its API subject to change.

In user space, there are two main approaches for communicating with a guest (VM), either through SR-IOV, or through virtIO.

Phy ports (SR-IOV) allow working with port representor, which is attached to the OVS and a matching VF is given with pass-through to the guest. HW rules can process packets from up-link and direct them to the VF without going through SW (OVS). Therefore, using SR-IOV achieves the best performance.

However, SR-IOV architecture requires the guest to use a driver specific to the underlying HW. Specific HW driver has two main drawbacks:

Breaks virtualization in some sense (guest is aware of the HW). It can also limit the type of images supported. Gives less natural support for live migration.

Using virtIO port solves both problems. However, it reduces performance and causes loss of some functionalities, such as, for some HW offloads, working directly with virtIO. To solve this conflict, a new netdev type- dpdkvdpa has been created. The new netdev is similar to the regular DPDK netdev, yet introduces several additional functionalities.

dpdkvdpa translates between phy port to virtIO port. It takes packets from the Rx queue and sends them to the suitable Tx queue, and allows transfer of packets from virtIO guest (VM) to a VF, and vice-versa, benefitting from both SR-IOV and virtIO.

To add vDPA port:

Copy Copied! ovs-vsctl add-port br0 vdpa0 -- set Interface vdpa0 type=dpdkvdpa \ options:vdpa-socket-path=<sock path> \ options:vdpa-accelerator-devargs=<vf pci id> \ options:dpdk-devargs=<pf pci id>,representor=[id] \ options: vdpa-max-queues =<num queues> \ options: vdpa-sw=< true / false >

Note: vdpa-max-queues is an optional field. When the user wants to configure 32 vDPA ports, the maximum queues number is limited to 8.

Prior to configuring vDPA in OVS-DPDK mode, follow the steps below.

Generate the VF. Copy Copied! echo 0 > /sys/ class /net/enp175s0f0/device/sriov_numvfs echo 4 > /sys/ class /net/enp175s0f0/device/sriov_numvfs Unbind each VF. Copy Copied! echo <pci> > /sys/bus/pci/drivers/mlx5_core/unbind Switch to SwitchDev mode. Copy Copied! echo switchdev >> /sys/ class /net/enp175s0f0/compat/devlink/mode Bind each VF. Copy Copied! echo <pci> > /sys/bus/pci/drivers/mlx5_core/bind Initialize OVS with: Copy Copied! ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init= true ovs-vsctl --no-wait set Open_vSwitch . other_config:hw-offload= true

To configure vDPA in OVS-DPDK mode on ConnectX-5 cards and above:

Open vSwitch configuration. Copy Copied! ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra= "-w 0000:01:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=1" /usr/share/openvswitch/scripts/ovs-ctl restart Create OVS-DPDK bridge. Copy Copied! ovs-vsctl add-br br0-ovs -- set bridge br0-ovs datapath_type=netdev ovs-vsctl add-port br0-ovs pf -- set Interface pf type=dpdk options:dpdk-devargs= 0000 : 01 : 00.0 Create vDPA port as part of the OVS-DPDK bridge. Copy Copied! ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs= 0000 : 01 : 00.2 options:dpdk-devargs= 0000 : 01 : 00.0 ,representor=[ 0 ] options: vdpa-max-queues= 8

To configure vDPA in OVS-DPDK mode on BlueField cards:

Set the bridge with the software or hardware vDPA port:

On the ARM side :

Create the OVS-DPDK bridge. Copy Copied! ovs-vsctl add-br br0-ovs -- set bridge br0-ovs datapath_type=netdev ovs-vsctl add-port br0-ovs pf -- set Interface pf type=dpdk options:dpdk-devargs= 0000 :af: 00.0 ovs-vsctl add-port br0-ovs rep-- set Interface rep type=dpdk options:dpdk-devargs= 0000 :af: 00.0 ,representor=[ 0 ]

On the host side:

Create the OVS-DPDK bridge. Copy Copied! ovs-vsctl add-br br1-ovs -- set bridge br1-ovs datapath_type=netdev protocols=OpenFlow14 ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs= 0000 :af: 00.2 Note: To configure SW vDPA, add " options:vdpa-sw=true" to the end of the command.

SW vDPA can also be used in configurations where the HW offload is done through TC and not DPDK.

Open vSwitch configuration. Copy Copied! ovs-vsctl set Open_vSwitch . other_config:dpdk-extra= "-w 0000:01:00.0,representor=[0],dv_flow_en=1,dv_esw_en=0,idv_xmeta_en=0,isolated_mode=1" /usr/share/openvswitch/scripts/ovs-ctl restart Create OVS-DPDK bridge. Copy Copied! ovs-vsctl add-br br0-ovs -- set bridge br0-ovs datapath_type=netdev Create vDPA port as part of the OVS-DPDK bridge. Copy Copied! ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs= 0000 : 01 : 00.2 options:dpdk-devargs= 0000 : 01 : 00.0 ,representor=[ 0 ] options: vdpa-max-queues= 8 Create Kernel bridge. Copy Copied! ovs-vsctl add-br br-kernel Add representors to Kernel bridge. Copy Copied! ovs-vsctl add-port br-kernel enp1s0f0_0 ovs-vsctl add-port br-kernel enp1s0f0

Warning This feature is at beta level.

OVS offload rules are based on a multi-table architecture. E2E cache feature enables merging the multi-table flow matches and actions into one joint flow.

This improves connection tracking performance by using a single-table when exact match is detected.

To s

et the E2E cache size (default = 4k):

Copy Copied! ovs-vsctl set open_vswitch . other_config:e2e-size=<size> systemctl restart openvswitch

Note: Make sure to restart the openvswitch service in order for the configuration to take effect.

To e

nable/disable E2E cache (default = disabled) :

Copy Copied! ovs-vsctl set open_vswitch . other_config:e2e-enable=< true / false > systemctl restart openvswitch

Note: Make sure to restart the openvswitch service in order for the configuration to take effect.

To run

E2E cache statistics:

Copy Copied! ovs-appctl dpctl/dump-e2e-stats

To run E2E cache flows:

Copy Copied! ovs-appctl dpctl/dump-e2e-flows





Geneve tunneling offload feature support includes matching on extension header.

To configure OVS-DPDK Geneve encap/decap:

Create a br-phy bridge. Copy Copied! ovs-vsctl --may-exist add-br br-phy -- set Bridge br-phy datapath_type=netdev -- br-set-external-id br-phy bridge-id br-phy -- set bridge br-phy fail-mode=standalone Attach PF interface to br-phy bridge. Copy Copied! ovs-vsctl add-port br-phy pf -- set Interface pf type=dpdk options:dpdk-devargs=<PF PCI> Configure IP to the bridge. Copy Copied! ifconfig br-phy <$local_ip_1> up Create a br-int bridge. Copy Copied! ovs-vsctl --may-exist add-br br- int -- set Bridge br- int datapath_type=netdev -- br-set-external-id br- int bridge-id br- int -- set bridge br- int fail-mode=standalone Attach representor to br-int. Copy Copied! ovs-vsctl add-port br- int rep$x -- set Interface rep$x type=dpdk options:dpdk-devargs=<PF PCI>,representor=[$x] Add a port for the GENEVE tunnel. Copy Copied! ovs-vsctl add-port br- int geneve0 -- set interface geneve0 type=geneve options:key=<VNI> options:remote_ip=<$remote_ip_1> options:local_ip=<$local_ip_1>

OVS-DPDK supports parallel insertion and deletion of offloads (flow & CT). While multiple threads are supported, by default only one is used.

To c

onfigure multiple threads:

Copy Copied! ovs-vsctl set Open_vSwitch . other_config:hw-offload-thread-nb= 3

Make sure to restart the openvswitch service in order for the configuration to take effect.

Copy Copied! systemctl restart openvswitch