Open vSwitch (OVS) allows Virtual Machines (VM) to communicate with each other and with the outside world. OVS traditionally resides in the hypervisor and switching is based on twelve tuple matching on flows. The OVS software based solution is CPU intensive, affecting system performance and preventing full utilization of the available bandwidth. The current OVS supported is OVS running in Linux Kernel.
Mellanox Accelerated Switching And Packet Processing (ASAP2) technology allows OVS offloading by handling OVS data-plane in Mellanox ConnectX-5 onwards NIC hardware (Mellanox Embedded Switch or eSwitch) while maintaining OVS control-plane unmodified. As a result, we observe significantly higher OVS performance without the associated CPU load.
The current actions supported by ASAP2 include packet parsing and matching, forward, drop along with VLAN push/pop or VXLAN encapsulation/decapsulation.
Installing ASAP2 Packages
Install the required packages. For the complete solution, you need to install supporting MLNX_EN (v4.4 and above), iproute2, and openvswitch packages.
Setting Up SR-IOV
To set up SR-IOV:
Choose the desired card.
The example below shows a dual-ported ConnectX-5 card (device ID 0x1017) and a single SR-IOV VF (Virtual Function, device ID 0x1018).
In SR-IOV terms, the card itself is referred to as the PF (Physical Function).
Enabling SR-IOV and creating VFs is done by the firmware upon admin directive as explained in Step 5 below.
Identify the Mellanox NICs and locate net-devices which are on the NIC PCI BDF.
The PF NIC for port #1 is enp4s0f0, and the rest of the commands will be issued on it.
Check the firmware version.
Make sure the firmware versions installed are as state in the Release Notes document.
Make sure SR-IOV is enabled on the system (server, card).
Make sure SR-IOV is enabled by the server BIOS, and by the firmware with up to N VFs, where N is the number of VFs required for your environment. Refer to "Mellanox Firmware Tools" below for more details.
Turn ON SR-IOV on the PF device.
Provision the VF MAC addresses using the IP tool.
Verify the VF MAC addresses were provisioned correctly and SR-IOV was turned ON.
In the example above, the maximum number of possible VFs supported by the firmware is 4 and only 2 are enabled.
- Provision the PCI VF devices to VMs using PCI Pass-Through or any other preferred virt tool of choice, e.g virt-manager.
For further information on SR-IOV, refer to https://support.mellanox.com/docs/DOC-2386.
Configuring Open-vSwitch (OVS) Offload
Unbind the VFs.
VMs with attached VFs must be powered off to be able to unbind the VFs.
Change the e-switch mode from legacy to switchdev on the PF device.
This will also create the VF representor netdevices in the host OS.
Before changing the mode, make sure that all VFs are unbound.
To go back to SR-IOV legacy mode:
# echo legacy > /sys/class/net/enp4s0f0/compat/devlink/mode
Running this command, will also remove the VF representor netdevices.
- Set the network VF representor device names to be in the form of
PF netdevname, and
VF ID=0,1,[..], either by:
* Using this rule in
Replace the phys_switch_id value ("e41d2d60971d" above) with the value matching your switch, as obtained from:
Example output of device names when using the udev rule:
* Using the supplied
vf-net-link-name.shscript to set the VF representor device names.
From the scripts directory copy
vf-net-link-name.sh to /etc/udev/and
82-net-setup-link.rules to /etc/udev/rules.d/.
Run the openvswitch service.
Create an OVS bridge (here it's named ovs-sriov).
Enable hardware offload (disabled by default).
Restart the openvswitch service. This step is required for HW offload changes to take effect.
HW offload policy can also be changed by setting the tc-policy using one on the following values:
* none - adds a TC rule to both the software and the hardware (default)
* skip_sw - adds a TC rule only to the hardware
* skip_hw - adds a TC rule only to the software
The above change is used for debug purposes.
Add the PF and the VF representor netdevices as OVS ports.
Make sure to bring up the PF and representor netdevices.
The PF represents the uplink (wire).
Run traffic from the VFs and observe the rules added to the OVS data-path.
In the example above, the ping was initiated from VF0 (OVS port 3) to the outer node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is e4:1d:2d:a5:f3:9d
As shown above, two OVS rules were added, one in each direction.
Note that you can also verify offloaded packets using by adding type=offloaded to the command. For example:
Flow Statistics and Aging
The aging timeout of OVS is given is ms and can be controlled with this command:
It is common to require the VM traffic to be tagged by the OVS. Such that, the OVS adds tags (vlan push) to the packets sent by the VMs and strips (vlan pop) the packets received for this VM from other nodes/VMs.
To do so, add a tag=$TAG section for the OVS command line that adds the representor ports, for example here we use vlan ID 52.
The PF port should not have a VLAN attached. This will cause OVS to add VLAN push/pop actions when managing traffic for these VFs.
To see how the OVS rules look with vlans, here we initiated a ping from VF0 (OVS port 3) to an outer node (OVS port 2), where the VF MAC is
e4:11:22:33:44:50 and the outer node MAC is
At this stage, we can see that two OVS rules were added, one in each direction.
For outgoing traffic (in port = 3), the actions are push vlan (52) and forward to port 2
For incoming traffic (in port = 2), matching is done also on vlan, and the actions are pop vlan and forward to port 3
Offloading VXLAN Encapsulation/Decapsulation Actions
VXLAN encapsulation / decapsulation offloading of OVS actions is supported only in ConnectX-5 adapter cards.
In case of offloading VXLAN, the PF should not be added as a port in the OVS data-path but rather be assigned with the IP address to be used for encapsulation.
The example below shows two hosts (PFs) with IPs
18.104.22.168, where the PF device on both hosts is
enp4s0f0 and the VXLAN tunnel is set with VNID 98:
On the first host:
On the second host:
When encapsulating guest traffic, the VF’s device MTU must be reduced to allow the host/HW add the encap headers without fragmenting the resulted packet. As such, the VF’s MTU must be lowered to 1450 for IPv4 and 1430 for IPv6.
To see how the OVS rules look with vxlan encap/decap actions, here we initiated a ping from a VM on the 1st host whose MAC is
e4:11:22:33:44:50 to a VM on the 2nd host whose MAC is
At this stage we see that two OVS rules were added to the first host; one in each direction.
For outgoing traffic (in port = 3), the actions are set vxlan tunnel to host 22.214.171.124 (encap) and forward to port 2
For incoming traffic (in port = 2), matching is done also on vxlan tunnel info which was decapsulated, and the action is forward to port 3
Manually Adding TC Rules
Offloading rules can also be added directly, and not just through OVS, using the tc utility.
To enable TC ingress on both the PF and the VF.
Bond rules can be added in one of the following methods:
Using shared block (requires kernel support):
Add drop rule:
Add redirect rule from bond to representor:
Add redirect rule from representor to bond:
- Without using shared block:
Add redirect rule from bond to representor:
Add redirect rule from representor to bond:
VLAN Modify rules can be added in one of the following methods:
SR-IOV VF LAG
SR-IOV VF LAG allows the NIC’s physical functions to get the rules that the OVS will try to offload to the bond net-device, and to offload them to the hardware e-switch. Bond modes supported are:
To enable SR-IOV LAG, both physical functions of the NIC should first be configured to SR-IOV switchdev mode, and only afterwards bond the up-link representors.
The example below shows the creation of bond interface on two PFs:
Load bonding device and enslave the up-link representor (currently PF) net-device devices.
Add the VF representor net-devices as OVS ports. If tunneling is not used, add the bond device as well.
Make sure to bring up the PF and the representor netdevices.
Port Mirroring (Flow Based VF Traffic Mirroring for ASAP²)
Port Mirroring is currently supported in ConnectX-5 adapter cards only.
Unlike para-virtual configurations, when the VM traffic is offloaded to the hardware via SRIOV VF, the host side Admin cannot snoop the traffic (e.g. for monitoring).
ASAP² uses the existing mirroring support in OVS and TC along with the enhancement to the offloading logic in the driver to allow mirroring the VF traffic to another VF.
The mirrored VF can be used to run traffic analyzer (tcpdump, wireshark, etc) and observe the traffic of the VF being mirrored.
The example below shows the creation of port mirror on the following configuration:
If we want to set enp4s0f0_0 as the mirror port, and mirror all of the traffic, set it as follow:
If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic, the destination is enp4s0f0_1, set it as follow:
If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic the source is enp4s0f0_1, set it as follow:
If we want to set enp4s0f0_0 as the mirror port and mirror, all the traffic on enp4s0f0_1, set it as follow:
To clear the mirror port:
Performance Tuning Based on Traffic Patterns
Offloaded flows (including connection tracking) are added to virtual switch FDB flow tables. FDB tables have a set of flow groups. Each flow group saves the same traffic pattern flows. For example, for connection tracking offloaded flow, TCP and UDP are different traffic patterns which end up in two different flow groups.
A flow group has a limited size to save flow entries. By default, the driver has 4 big FDB flow groups. Each of these big flow groups can save at most 4000000/(4+1)=800k different 5-tuple flow entries. For scenarios with more than 4 traffic patterns, the driver provides a module parameter (num_of_groups) to allow customization and performance tune.
The size of each big flow group can be calculated according to the following formula.
size = 4000000/(num_of_groups+1)
To change the number of big FDB flow groups, run:
The change takes effect immediately if there is no flow inside the FDB table (no traffic running and all offloaded flows are aged out), and it can be dynamically changed without reloading the driver.
The module parameter can be set statically in /etc/modprobe.d/mlnx.conf file. This way the administrator will not be required to set it via sysfs each time the driver is reloaded.
If there are residual offloaded flows when changing this parameter, then the new configuration only takes effect after all flows age out.
Appendix: Mellanox Firmware Tools
Download and install the MFT package corresponding to your computer’s operating system. You would need the kernel-devel or kernel-headers RPM before the tools are built and installed.
The package is available at http://www.mellanox.com => Products => Software => Firmware Tools.
Start the mst driver.
Show the devices status.
Make sure your configuration is as follows:
* SR-IOV is enabled (SRIOV_EN=1)
* The number of enabled VFs is enough for your environment (NUM_OF_VFS=N)
* The port’s link type is Ethernet (LINK_TYPE_P1/2=2) when applicable
If this is not the case, use mlxconfig to enable that, as follows:
Set the number of required VFs.
Set the link type to Ethernet.
Reset the firmware.
Query the firmware to make sure everything is set correctly.