OVS Offload Using ASAP² Direct
Open vSwitch (OVS) allows Virtual Machines (VMs) to communicate with each other and with the outside world. OVS traditionally resides in the hypervisor and switching is based on twelve tuple matching on flows. The OVS software based solution is CPU intensive, affecting system performance and preventing full utilization of the available bandwidth.
Mellanox Accelerated Switching And Packet Processing (ASAP2) technology allows OVS offloading by handling OVS data-plane in Mellanox ConnectX-5 onwards NIC hardware (Mellanox Embedded Switch or eSwitch) while maintaining OVS control-plane unmodified. As a result, we observe significantly higher OVS performance without the associated CPU load.
As of v5.0, OVS-DPDK became part of MLNX_OFED package as well. OVS-DPDK supports ASAP2 just as the OVS-Kernel (Traffic Control (TC) kernel-based solution) does, yet with a different set of features.
The traditional ASAP2 hardware data plane is built over SR-IOV virtual functions (VFs), so that the VF is passed through directly to the VM, with the Mellanox driver running within the VM. An alternate approach that is also supported is vDPA (vhost Data Path Acceleration). vDPA allows the connection to the VM to be established using VirtIO, so that the data-plane is built between the SR-IOV VF and the standard VirtIO driver within the VM, while the control-plane is managed on the host by the vDPA application. Two flavors of vDPA are supported, Software vDPA; and Hardware vDPA. Software vDPA management functionality is embedded into OVS-DPDK, while Hardware vDPA uses a standalone application for management, and can be run with both OVS-Kernel and OVS-DPDK. For further information, please see sections VirtIO Acceleration through VF Relay (Software vDPA) and VirtIO Acceleration through Hardware vDPA.
Install the required packages. For the complete solution, you need to install supporting MLNX_OFED (v4.4 and above), iproute2, and openvswitch packages.
Run:
./mlnxofedinstall --ovs-dpdk –upstream-libs
Note that this section applies to both OVS-DPDK and OVS-Kernel similarly.
To set up SR-IOV:
Choose the desired card.
The example below shows a dual-ported ConnectX-5 card (device ID 0x1017) and a single SR-IOV VF (Virtual Function, device ID 0x1018).
In SR-IOV terms, the card itself is referred to as the PF (Physical Function).
# lspci -nn | grep Mellanox 0a:
00.0
Ethernet controller [0200
]: Mellanox Technologies MT27800 Family [ConnectX-5
] [15b3:1017
] 0a:00.1
Ethernet controller [0200
]: Mellanox Technologies MT27800 Family [ConnectX-5
] [15b3:1017
] 0a:00.2
Ethernet controller [0200
]: Mellanox Technologies MT27800 Family [ConnectX-5
Virtual Function] [15b3:1018
]WarningEnabling SR-IOV and creating VFs is done by the firmware upon admin directive as explained in Step 5 below.
Identify the Mellanox NICs and locate net-devices which are on the NIC PCI BDF.
# ls -l /sys/
class
/net/ | grep04
:00
lrwxrwxrwx1
root root0
Mar27
16
:58
enp4s0f0 -> ../../devices/pci0000:00
/0000
:00
:03.0
/0000
:04
:00.0
/net/enp4s0f0 lrwxrwxrwx1
root root0
Mar27
16
:58
enp4s0f1 -> ../../devices/pci0000:00
/0000
:00
:03.0
/0000
:04
:00.1
/net/enp4s0f1 lrwxrwxrwx1
root root0
Mar27
16
:58
eth0 -> ../../devices/pci0000:00
/0000
:00
:03.0
/0000
:04
:00.2
/net/eth0 lrwxrwxrwx1
root root0
Mar27
16
:58
eth1 -> ../../devices/pci0000:00
/0000
:00
:03.0
/0000
:04
:00.3
/net/eth1The PF NIC for port #1 is enp4s0f0, and the rest of the commands will be issued on it.
Check the firmware version.
Make sure the firmware versions installed are as state in the Release Notes document.# ethtool -i enp4s0f0 | head -
5
driver: mlx5_core version:5.0
-5
firmware-version:16.21
.0338
expansion-rom-version: bus-info:0000
:04
:00.0
Make sure SR-IOV is enabled on the system (server, card).
Make sure SR-IOV is enabled by the server BIOS, and by the firmware with up to N VFs, where N is the number of VFs required for your environment. Refer to "Mellanox Firmware Tools" below for more details.# cat /sys/
class
/net/enp4s0f0/device/sriov_totalvfs4
Turn ON SR-IOV on the PF device.
# echo
2
> /sys/class
/net/enp4s0f0/device/sriov_numvfsProvision the VF MAC addresses using the IP tool.
# ip link set enp4s0f0 vf
0
mac e4:11
:22
:33
:44
:50
# ip link set enp4s0f0 vf1
mac e4:11
:22
:33
:44
:51
Verify the VF MAC addresses were provisioned correctly and SR-IOV was turned ON.
# cat /sys/
class
/net/enp4s0f0/device/sriov_numvfs2
# ip link show dev enp4s0f0256
: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu1500
qdisc mq master ovs-system state UP mode DEFAULT groupdefault
qlen1000
link/ether e4:1d:2d:60
:95
:a0 brd ff:ff:ff:ff:ff:ff vf0
MAC e4:11
:22
:33
:44
:50
, spoof checking off, link-state auto vf1
MAC e4:11
:22
:33
:44
:51
, spoof checking off, link-state autoIn the example above, the maximum number of possible VFs supported by the firmware is 4 and only 2 are enabled.
Provision the PCI VF devices to VMs using PCI Pass-Through or any other preferred virt tool of choice, e.g virt-manager.
For further information on SR-IOV, refer to https://support.mellanox.com/docs/DOC-2386.
OVS-Kernel Hardware Offloads
Configuring Uplink Representor Mode
Please note that this step is optional. However, if you wish to configure uplink representor mode, make sure this step is performed before configuring SwitchDev.
The following are the uplink representor modes available for configuration
new_netdev: default mode - when found in this mode, the uplink representor is created as a new netdevice
nic_netdev: when found in this mode, the NIC netdevice acts as an uplink representor device
Example:
echo nic_netdev > /sys/class
/net/ens1f0/compat/devlink/uplink_rep_mode
Notes:
The mode can only be changed when found in Legacy mode
The mode is not saved when reloading mlx5_core
When two PFs in the same bonding device need to enter the SwitchDev mode, the uplink representor mode for both PFs should be same (either nic_netdev or new_netdev)
Configuring SwitchDev
Unbind the VFs.
echo
0000
:04
:00.2
> /sys/bus/pci/drivers/mlx5_core/unbind echo0000
:04
:00.3
> /sys/bus/pci/drivers/mlx5_core/unbindWarningVMs with attached VFs must be powered off to be able to unbind the VFs.
Change the e-switch mode from legacy to switchdev on the PF device.
This will also create the VF representor netdevices in the host OS.# echo switchdev > /sys/
class
/net/enp4s0f0/compat/devlink/modeWarningBefore changing the mode, make sure that all VFs are unbound.
WarningTo go back to SR-IOV legacy mode:
# echo legacy > /sys/class/net/enp4s0f0/compat/devlink/mode
Running this command, will also remove the VF representor netdevices.Set the network VF representor device names to be in the form of $PF_$VFID where $PF is the PF netdev name, and $VFID is the VF ID=0,1,[..], either by:
* Using this rule in /etc/udev/rules.d/82-net-setup-link.rulesSUBSYSTEM==
"net"
, ACTION=="add"
, ATTR{phys_switch_id}=="e41d2d60971d"
, \ ATTR{phys_port_name}!=""
, NAME="enp4s0f1_$attr{phys_port_name}"
Replace the phys_switch_id value ("e41d2d60971d" above) with the value matching your switch, as obtained from:
ip -d link show enp4s0f1
Example output of device names when using the udev rule:
ls -l /sys/
class
/net/ens4* lrwxrwxrwx1
root root0
Mar27
17
:14
enp4s0f0 -> ../../devices/pci0000:00
/0000
:00
:03.0
/0000
:04
:00.0
/net/enp4s0f0 lrwxrwxrwx1
root root0
Mar27
17
:15
enp4s0f0_0 -> ../../devices/virtual/net/enp4s0f0_0 lrwxrwxrwx1
root root0
Mar27
17
:15
enp4s0f0_1 -> ../../devices/virtual/net/enp4s0f0_1* Using the supplied 82-net-setup-link.rules and vf-net-link-name.sh script to set the VF representor device names.
From the scripts directory copy vf-net-link-name.sh to /etc/udev/ and 82-net-setup-link.rules to /etc/udev/rules.d/.
Make sure vf-net-link-name.sh is executable.Run the openvswitch service.
# systemctl start openvswitch
Create an OVS bridge (here it's named ovs-sriov).
# ovs-vsctl add-br ovs-sriov
Enable hardware offload (disabled by default).
# ovs-vsctl set Open_vSwitch . other_config:hw-offload=
true
Restart the openvswitch service. This step is required for HW offload changes to take effect.
# systemctl restart openvswitch
WarningHW offload policy can also be changed by setting the tc-policy using one on the following values:
* none - adds a TC rule to both the software and the hardware (default)
* skip_sw - adds a TC rule only to the hardware
* skip_hw - adds a TC rule only to the software
The above change is used for debug purposes.
Add the PF and the VF representor netdevices as OVS ports.
# ovs-vsctl add-port ovs-sriov enp4s0f0 # ovs-vsctl add-port ovs-sriov enp4s0f0_0 # ovs-vsctl add-port ovs-sriov enp4s0f0_1
Make sure to bring up the PF and representor netdevices.
# ip link set dev enp4s0f0 up # ip link set dev enp4s0f0_0 up # ip link set dev enp4s0f0_1 up
The PF represents the uplink (wire).
# ovs-dpctl show system
@ovs
-system: lookups: hit:0
missed:192
lost:1
flows:2
masks: hit:384
total:2
hit/pkt:2.00
port0
: ovs-system (internal) port1
: ovs-sriov (internal) port2
: enp4s0f0 port3
: enp4s0f0_0 port4
: enp4s0f0_1Run traffic from the VFs and observe the rules added to the OVS data-path.
# ovs-dpctl dump-flows recirc_id(
0
),in_port(3
),eth(src=e4:11
:22
:33
:44
:50
,dst=e4:1d:2d:a5:f3:9d), eth_type(0x0800
),ipv4(frag=no), packets:33
, bytes:3234
, used:1
.196s, actions:2
recirc_id(0
),in_port(2
),eth(src=e4:1d:2d:a5:f3:9d,dst=e4:11
:22
:33
:44
:50
), eth_type(0x0800
),ipv4(frag=no), packets:34
, bytes:3332
, used:1
.196s, actions:3
In the example above, the ping was initiated from VF0 (OVS port 3) to the outer node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is e4:1d:2d:a5:f3:9d
As shown above, two OVS rules were added, one in each direction.
Note that you can also verify offloaded packets using by adding type=offloaded to the command. For example:# ovs-dpctl dump-flows type=offloaded
Flow Statistics and Aging
The aging timeout of OVS is given is ms and can be controlled with this command:
# ovs-vsctl set Open_vSwitch . other_config:max-idle=30000
Offloading VLANs
It is common to require the VM traffic to be tagged by the OVS. Such that, the OVS adds tags (vlan push) to the packets sent by the VMs and strips (vlan pop) the packets received for this VM from other nodes/VMs.
To do so, add a tag=$TAG section for the OVS command line that adds the representor ports, for example here we use vlan ID 52.
# ovs-vsctl add-port ovs-sriov enp4s0f0
# ovs-vsctl add-port ovs-sriov enp4s0f0_0 tag=52
# ovs-vsctl add-port ovs-sriov enp4s0f0_1 tag=52
The PF port should not have a VLAN attached. This will cause OVS to add VLAN push/pop actions when managing traffic for these VFs.
To see how the OVS rules look with vlans, here we initiated a ping from VF0 (OVS port 3) to an outer node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is 00:02:c9:e9:bb:b2.
At this stage, we can see that two OVS rules were added, one in each direction.
recirc_id(0
),in_port(3
),eth(src=e4:11
:22
:33
:44
:50
,dst=00
:02
:c9:e9:bb:b2),eth_type(0x0800
),ipv4(frag=no), \
packets:0
, bytes:0
, used:never, actions:push_vlan(vid=52
,pcp=0
),2
recirc_id(0
),in_port(2
),eth(src=00
:02
:c9:e9:bb:b2,dst=e4:11
:22
:33
:44
:50
),eth_type(0x8100
), \
vlan(vid=52
,pcp=0
),encap(eth_type(0x0800
),ipv4(frag=no)), packets:0
, bytes:0
, used:never, actions:pop_vlan,3
For outgoing traffic (in port = 3), the actions are push vlan (52) and forward to port 2
For incoming traffic (in port = 2), matching is done also on vlan, and the actions are pop vlan and forward to port 3
Offloading VXLAN Encapsulation/Decapsulation Actions
VXLAN encapsulation / decapsulation offloading of OVS actions is supported only in ConnectX-5 adapter cards.
In case of offloading VXLAN, the PF should not be added as a port in the OVS data-path but rather be assigned with the IP address to be used for encapsulation.
The example below shows two hosts (PFs) with IPs 1.1.1.177 and 1.1.1.75, where the PF device on both hosts is enp4s0f0 and the VXLAN tunnel is set with VNID 98:
On the first host:
# ip addr add
1.1
.1.177
/24
dev enp4s0f1 # ovs-vsctl add-port ovs-sriov vxlan0 -- setinterface
vxlan0 type=vxlan options:local_ip=1.1
.1.177
options:remote_ip=1.1
.1.75
options:key=98
On the second host:
# ip addr add
1.1
.1.75
/24
dev enp4s0f1 # ovs-vsctl add-port ovs-sriov vxlan0 -- setinterface
vxlan0 type=vxlan options:local_ip=1.1
.1.75
options:remote_ip=1.1
.1.177
options:key=98
When encapsulating guest traffic, the VF’s device MTU must be reduced to allow the host/HW add the encap headers without fragmenting the resulted packet. As such, the VF’s MTU must be lowered to 1450 for IPv4 and 1430 for IPv6.
To see how the OVS rules look with vxlan encap/decap actions, here we initiated a ping from a VM on the 1st host whose MAC is e4:11:22:33:44:50 to a VM on the 2nd host whose MAC is 46:ac:d1:f1:4c:af
At this stage we see that two OVS rules were added to the first host; one in each direction.
# ovs-dpctl show
system@ovs
-system:
lookups: hit:7869
missed:241
lost:2
flows: 2
masks: hit:13726
total:10
hit/pkt:1.69
port 0
: ovs-system (internal)
port 1
: ovs-sriov (internal)
port 2
: vxlan_sys_4789 (vxlan)
port 3
: enp4s0f1_0
port 4
: enp4s0f1_1
# ovs-dpctl dump-flows
recirc_id(0
),in_port(3
),eth(src=e4:11
:22
:33
:44
:50
,dst=46
:ac:d1:f1:4c:af),eth_type(0x0800
),ipv4(tos=0
/0x3
,frag=no),
packets:4
, bytes:392
, used:0
.664s, actions:set(tunnel(tun_id=0x62
,dst=1.1
.1.75
,ttl=64
,flags(df,key))),2
recirc_id(0
),tunnel(tun_id=0x62
,src=1.1
.1.75
,dst=1.1
.1.177
,ttl=64
,flags(-df-csum+key)),
in_port(2
),skb_mark(0
),eth(src=46
:ac:d1:f1:4c:af,dst=e4:11
:22
:33
:44
:50
),eth_type(0x0800
),ipv4(frag=no), packets:5
, bytes:490
, used:0
.664s, actions:3
For outgoing traffic (in port = 3), the actions are set vxlan tunnel to host 1.1.1.75 (encap) and forward to port 2
For incoming traffic (in port = 2), matching is done also on vxlan tunnel info which was decapsulated, and the action is forward to port 3
Manually Adding TC Rules
Offloading rules can also be added directly, and not just through OVS, using the tc utility.
To enable TC ingress on both the PF and the VF.
# tc qdisc add dev enp4s0f0 ingress
# tc qdisc add dev enp4s0f0_0 ingress
# tc qdisc add dev enp4s0f0_1 ingress
Examples
L2 Rule
# tc filter add dev ens4f0_0 protocol ip parent ffff: \
flower \
skip_sw \
dst_mac e4:11
:22
:11
:4a:51
\
src_mac e4:11
:22
:11
:4a:50
\
action drop
VLAN Rule
# tc filter add dev ens4f0_0 protocol 802
.1Q parent ffff: \
flower \
skip_sw \
dst_mac e4:11
:22
:11
:4a:51
\
src_mac e4:11
:22
:11
:4a:50
\
action vlan push id 100
\
action mirred egress redirect dev ens4f0
# tc filter add dev ens4f0 protocol 802
.1Q parent ffff: \
flower \
skip_sw \
dst_mac e4:11
:22
:11
:4a:51
\
src_mac e4:11
:22
:11
:4a:50
\
vlan_ethtype 0x800
\
vlan_id 100
\
vlan_prio 0
\
action vlan pop \
action mirred egress redirect dev ens4f0_0
VXLAN Rule
# tc filter add dev ens4f0_0 protocol 0x806
parent ffff: \
flower \
skip_sw \
dst_mac e4:11
:22
:11
:4a:51
\
src_mac e4:11
:22
:11
:4a:50
\
action tunnel_key set \
src_ip 20.1
.12.1
\
dst_ip 20.1
.11.1
\
id 100
\
action mirred egress redirect dev vxlan100
# tc filter add dev vxlan100 protocol 0x806
parent ffff: \
flower \
skip_sw \
dst_mac e4:11
:22
:11
:4a:51
\
src_mac e4:11
:22
:11
:4a:50
\
enc_src_ip 20.1
.11.1
\
enc_dst_ip 20.1
.12.1
\
enc_key_id 100
\
enc_dst_port 4789
\
action tunnel_key unset \
action mirred egress redirect dev ens4f0_0
Bond Rule
Bond rules can be added in one of the following methods:
Using shared block (requires kernel support):
# tc qdisc add dev bond0 ingress_block
22
ingress # tc qdisc add dev ens4p0 ingress_block22
ingress # tc qdisc add dev ens4p1 ingress_block22
ingressAdd drop rule:
# tc filter add block
22
protocol arp parent ffff: prio3
\ flower \ dst_mac e4:11
:22
:11
:4a:51
\ action dropAdd redirect rule from bond to representor:
# tc filter add block
22
protocol arp parent ffff: prio3
\ flower \ dst_mac e4:11
:22
:11
:4a:50
\ action mirred egress redirect dev ens4f0_0Add redirect rule from representor to bond:
# tc filter add dev ens4f0_0 protocol arp parent ffff: prio
3
\ flower \ dst_mac ec:0d:9a:8a:28
:42
\ action mirred egress redirect dev bond0
Without using shared block:
Add redirect rule from bond to representor:
# tc filter add dev bond0 protocol arp parent ffff: prio
1
\ flower \ dst_mac e4:11
:22
:11
:4a:50
\ action mirred egress redirect dev ens4f0_0Add redirect rule from representor to bond:
# tc filter add dev ens4f0_0 protocol arp parent ffff: prio
3
\ flower \ dst_mac ec:0d:9a:8a:28
:42
\ action mirred egress redirect dev bond0
VLAN Modify
VLAN Modify rules can be added in one of the following methods:
tc filter add dev $REP_DEV1 protocol 802
.1q ingress prio 1
flower \
vlan_id 10
\
action vlan modify id 11
pipe \
action mirred egress redirect dev $REP_DEV2
tc filter add dev $DEV_REP1 protocol 802
.1q ingress prio 1
flower \
vlan_id 10
\
action vlan pop pipe action vlan push id 11
pipe \
action mirred egress redirect dev $REP_DEV2
SR-IOV VF LAG
SR-IOV VF LAG allows the NIC’s physical functions (PFs) to get the rules that the OVS will try to offload to the bond net-device, and to offload them to the hardware e-switch. Bond modes supported are:
Active-Backup
XOR
LACP
SR-IOV VF LAG enables complete offload of the LAG functionality to the hardware. The bonding creates a single bonded PF port. Packets from up-link can arrive from any of the physical ports, and will be forwarded to the bond device.
When hardware offload is used, packets from both ports can be forwarded to any of the VFs. Traffic from the VF can be forwarded to both ports according to the bonding state. Meaning, when in active-backup mode, only one PF is up, and traffic from any VF will go through this PF. When in XOR or LACP mode, if both PFs are up, traffic from any VF will split between these two PFs.
SR-IOV VF LAG Configuration on ASAP2
To enable SR-IOV VF LAG, both physical functions of the NIC should first be configured to SR-IOV SwitchDev mode, and only afterwards bond the up-link representors.
The example below shows the creation of bond interface on two PFs:
Load bonding device and enslave the up-link representor (currently PF) net-device devices.
modprobe bonding mode=
802
.3ad Ifup bond0 (make sure ifcfg file is present with desired bond configuration) ip link set enp4s0f0 master bond0 ip link set enp4s0f1 master bond0Add the VF representor net-devices as OVS ports. If tunneling is not used, add the bond device as well.
ovs-vsctl add-port ovs-sriov bond0 ovs-vsctl add-port ovs-sriov enp4s0f0_0 ovs-vsctl add-port ovs-sriov enp4s0f1_0
Make sure to bring up the PF and the representor netdevices.
ip link set dev bond0 up ip link set dev enp4s0f0_0 up ip link set dev enp4s0f1_0 up
Once SR-IOV VF LAG is configured, all VFs of the two PFs will become part of the bond, and will behave as described above.
Limitations
In VF LAG mode, outgoing traffic in load balanced mode is according to the origin ring, thus, half of the rings will be coupled with port 1 and half with port 2. All the traffic on the same ring will be sent from the same port.
VF LAG configuration is not supported when the NUM_OF_VFS configured in mlxconfig is higher than 64.
Port Mirroring (Flow Based VF Traffic Mirroring for ASAP²)
Port Mirroring is currently supported in ConnectX-5 adapter cards only.
Unlike para-virtual configurations, when the VM traffic is offloaded to the hardware via SR-IOV VF, the host side Admin cannot snoop the traffic (e.g. for monitoring).
ASAP² uses the existing mirroring support in OVS and TC along with the enhancement to the offloading logic in the driver to allow mirroring the VF traffic to another VF.
The mirrored VF can be used to run traffic analyzer (tcpdump, wireshark, etc) and observe the traffic of the VF being mirrored.
The example below shows the creation of port mirror on the following configuration:
# ovs-vsctl show
09d8a574-9c39-465c-9f16-47d81c12f88a
Bridge br-vxlan
Port "enp4s0f0_1"
Interface "enp4s0f0_1"
Port "vxlan0"
Interface "vxlan0"
type: vxlan
options: {key="100"
, remote_ip="192.168.1.14"
}
Port "enp4s0f0_0"
Interface "enp4s0f0_0"
Port "enp4s0f0_2"
Interface "enp4s0f0_2"
Port br-vxlan
Interface br-vxlan
type: internal
ovs_version: "2.8.90"
If we want to set enp4s0f0_0 as the mirror port, and mirror all of the traffic, set it as follow:
# ovs-vsctl -- --id=
@p
get port enp4s0f0_0 \ -- --id=@m
create mirror name=m0 select-all=true
output-port=@p
\ -- set bridge br-vxlan mirrors=@m
If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic, the destination is enp4s0f0_1, set it as follow:
# ovs-vsctl -- --id=
@p1
get port enp4s0f0_0 \ -- --id=@p2
get port enp4s0f0_1 \ -- --id=@m
create mirror name=m0 select-dst-port=@p2
output-port=@p1
\ -- set bridge br-vxlan mirrors=@m
If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic the source is enp4s0f0_1, set it as follow:
# ovs-vsctl -- --id=
@p1
get port enp4s0f0_0 \ -- --id=@p2
get port enp4s0f0_1 \ -- --id=@m
create mirror name=m0 select-src-port=@p2
output-port=@p1
\ -- set bridge br-vxlan mirrors=@m
If we want to set enp4s0f0_0 as the mirror port and mirror, all the traffic on enp4s0f0_1, set it as follow:
# ovs-vsctl -- --id=
@p1
get port enp4s0f0_0 \ -- --id=@p2
get port enp4s0f0_1 \ -- --id=@m
create mirror name=m0 select-dst-port=@p2
select-src-port=@p2
output-port=@p1
\ -- set bridge br-vxlan mirrors=@m
# ovs-vsctl clear bridge br-vxlan mirrors
Performance Tuning Based on Traffic Patterns
Offloaded flows (including connection tracking) are added to virtual switch FDB flow tables. FDB tables have a set of flow groups. Each flow group saves the same traffic pattern flows. For example, for connection tracking offloaded flow, TCP and UDP are different traffic patterns which end up in two different flow groups.
A flow group has a limited size to save flow entries. By default, the driver has 4 big FDB flow groups. Each of these big flow groups can save at most 4000000/(4+1)=800k different 5-tuple flow entries. For scenarios with more than 4 traffic patterns, the driver provides a module parameter (num_of_groups) to allow customization and performance tune.
The size of each big flow group can be calculated according to the following formula.
size = 4000000/(num_of_groups+1)
To change the number of big FDB flow groups, run:
$ echo <num_of_groups> > /sys/module/mlx5_core/parameters/num_of_groups
The change takes effect immediately if there is no flow inside the FDB table (no traffic running and all offloaded flows are aged out), and it can be dynamically changed without reloading the driver.
The module parameter can be set statically in /etc/modprobe.d/mlnx.conf file. This way the administrator will not be required to set it via sysfs each time the driver is reloaded.
If there are residual offloaded flows when changing this parameter, then the new configuration only takes effect after all flows age out.
Note that the default value of num_of_groups may change per MLNX_OFED driver version. The following table lists the values that must be set when upgrading the MLNX_OFED version prior to driver load, in order to achieve the same OOB experience.
MLNX_OFED Version |
num_of_groups Default Value |
v4.7-3.2.9.0 |
4 |
v4.6-3.1.9.0.14 |
15 |
v4.6-3.1.9.0.15 |
15 |
v4.5-1.0.1.0.19 |
63 |
OVS-DPDK Hardware Offloads
Note that OVS-Kernel is only supported on ConnectX-5 and BlueField NICs only.
OVS-DPDK Hardware Offloads Configuration
To configure OVS-DPDK HW offloads:
Unbind the VFs.
echo
0000
:04
:00.2
> /sys/bus/pci/drivers/mlx5_core/unbind echo0000
:04
:00.3
> /sys/bus/pci/drivers/mlx5_core/unbindNote: VMs with attached VFs must be powered off to be able to unbind the VFs.
Change the e-switch mode from Legacy to SwitchDev on the PF device (make sure all VFs are unbound). This will also create the VF representor netdevices in the host OS.
echo switchdev > /sys/
class
/net/enp4s0f0/compat/devlink/modeTo revert to SR-IOV Legacy mode:
echo legacy > /sys/
class
/net/enp4s0f0/compat/devlink/modeNote that running this command will also result in the removal of the VF representor netdevices.
Bind the VFs.
echo
0000
:04
:00.2
> /sys/bus/pci/drivers/mlx5_core/bind echo0000
:04
:00.3
> /sys/bus/pci/drivers/mlx5_core/bindRun the Open vSwitch service.
systemctl start openvswitch
Enable hardware offload (disabled by default).
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=
true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
Configure the DPDK white list.
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra=
"-w 0000:01:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=1"
Restart the Open vSwitch service. This step is required for HW offload changes to take effect.
systemctl restart openvswitch
Add PF to OVS.
ovs-vsctl add-port br0-ovs pf -- set Interface pf type=dpdk options:dpdk-devargs=
0000
:88
:00.0
Add representor to OVS.
ovs-vsctl add-port br0-ovs representor -- set Interface representor type=dpdk options:dpdk-devargs=
0000
:88
:00.0
,representor=[$rep]
Offloading VXLAN Encapsulation/Decapsulation Actions
vSwitch in userspace rather than kernel-based Open vSwitch requires an additional bridge. The purpose of this bridge is to allow use of the kernel network stack for routing and ARP resolution.
The datapath needs to look-up the routing table and ARP table to prepare the tunnel header and transmit data to the output port.
Configuring VXLAN Encap/Decap Offloads
The configuration is done with:
PF on 0000:03:00.0 PCI and MAC 98:03:9b:cc:21:e8
Local IP 56.56.67.1 - br-phy interface will be configured to this IP
Remote IP 56.56.68.1
To configure OVS-DPDK VXLAN:
Create a br-phy bridge.
ovs-vsctl add-br br-phy -- set Bridge br-phy datapath_type=netdev -- br-set-external-id br-phy bridge-id br-phy -- set bridge br-phy fail-mode=standalone other_config:hwaddr=
98
:03
:9b:cc:21
:e8Attach PF interface to br-phy bridge.
ovs-vsctl add-port br-phy p0 -- set Interface p0 type=dpdk options:dpdk-devargs=
0000
:03
:00.0
3. Configure IP to the bridge.
ip addr add
56.56
.67.1
/24
dev br-phyCreate a br-ovs bridge.
ovs-vsctl add-br br-ovs -- set Bridge br-ovs datapath_type=netdev -- br-set-external-id br-ovs bridge-id br-ovs -- set bridge br-ovs fail-mode=standalone
Attach representor to br-ovs.
ovs-vsctl add-port br-ovs pf0vf0 -- set Interface pf0vf0 type=dpdk options:dpdk-devargs=
0000
:03
:00.0
,representor=[0
]Add a port for the VXLAN tunnel.
ovs-vsctl add-port ovs-sriov vxlan0 -- set
interface
vxlan0 type=vxlan options:local_ip=56.56
.67.1
options:remote_ip=56.56
.68.1
options:key=45
options:dst_port=4789
Connection Tracking Offload
Connection tracking enables stateful packet processing by keeping a record of currently open connections.
OVS flows using connection tracking can be accelerated using advanced Network Interface Cards (NICs) by offloading established connections.
VirtIO Acceleration through VF Relay (Software vDPA)
This feature has not been accepted to the OVS-DPDK Upstream yet, making its API subject to change.
In user space, there are two main approaches for communicating with a guest (VM), either through SR-IOV, or through virtIO.
Phy ports (SR-IOV) allow working with port representor, which is attached to the OVS and a matching VF is given with pass-through to the guest. HW rules can process packets from up-link and direct them to the VF without going through SW (OVS). Therefore, using SR-IOV achieves the best performance.
However, SR-IOV architecture requires the guest to use a driver specific to the underlying HW. Specific HW driver has two main drawbacks:
Breaks virtualization in some sense (guest is aware of the HW). It can also limit the type of images supported.
Gives less natural support for live migration.
Using virtIO port solves both problems. However, it reduces performance and causes loss of some functionalities, such as, for some HW offloads, working directly with virtIO. To solve this conflict, a new netdev type- dpdkvdpa has been created. The new netdev is similar to the regular DPDK netdev, yet introduces several additional functionalities.
dpdkvdpa translates between phy port to virtIO port. It takes packets from the Rx queue and sends them to the suitable Tx queue, and allows transfer of packets from virtIO guest (VM) to a VF, and vice-versa, benefitting from both SR-IOV and virtIO.
To add software vDPA port:
ovs-vsctl add-port br0 vdpa0 -- set Interface vdpa0 type=dpdkvdpa \
options:vdpa-socket-path=<sock path> \
options:vdpa-accelerator-devargs=<vf pci id> \
options:dpdk-devargs=<pf pci id>,representor=[id] \
options: vdpa-max-queues =<num queues>
Note: vdpa-max-queues is an optional field. When the user wants to configure 32 vDPA ports, the maximum queues number is limited to 8.
Software vDPA Configuration in OVS-DPDK Mode
Prior to configuring vDPA in OVS-DPDK mode, follow the steps below.
Generate the VF.
echo
0
> /sys/class
/net/enp175s0f0/device/sriov_numvfs echo4
> /sys/class
/net/enp175s0f0/device/sriov_numvfsUnbind each VF.
echo <pci> > /sys/bus/pci/drivers/mlx5_core/unbind
Switch to SwitchDev mode.
echo switchdev >> /sys/
class
/net/enp175s0f0/compat/devlink/modeBind each VF.
echo <pci> > /sys/bus/pci/drivers/mlx5_core/bind
Initialize OVS with:
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=
true
ovs-vsctl --no-wait set Open_vSwitch . other_config:hw-offload=true
To configure Software vDPA in OVS-DPDK mode:
Open vSwitch configuration.
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra=
"-w 0000:01:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=1"
/usr/share/openvswitch/scripts/ovs-ctl restartCreate OVS-DPDK bridge.
ovs-vsctl add-br br0-ovs -- set bridge br0-ovs datapath_type=netdev ovs-vsctl add-port br0-ovs pf -- set Interface pf type=dpdk options:dpdk-devargs=
0000
:01
:00.0
Create vDPA port as part of the OVS-DPDK bridge.
ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs=
0000
:01
:00.2
options:dpdk-devargs=0000
:01
:00.0
,representor=[0
] options: vdpa-max-queues=8
Software vDPA Configuration in OVS-Kernel Mode
SW vDPA can also be used in configurations where the HW offload is done through TC and not DPDK.
Open vSwitch configuration.
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=
"-w 0000:01:00.0,representor=[0],dv_flow_en=0,dv_esw_en=0,idv_xmeta_en=0,isolated_mode=1"
/usr/share/openvswitch/scripts/ovs-ctl restartCreate OVS-DPDK bridge.
ovs-vsctl add-br br0-ovs -- set bridge br0-ovs datapath_type=netdev
Create vDPA port as part of the OVS-DPDK bridge.
ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs=
0000
:01
:00.2
options:dpdk-devargs=0000
:01
:00.0
,representor=[0
] options: vdpa-max-queues=8
Create Kernel bridge.
ovs-vsctl add-br br-kernel
Add representors to Kernel bridge.
ovs-vsctl add-port br-kernel enp1s0f0_0 ovs-vsctl add-port br-kernel enp1s0f0
Hardware vDPA Installation
Hardware vDPA requires QEMU v4.0.0 and DPDK v20.02 as minimal versions.
To install QEMU:
Clone the sources:
git clone https://git.qemu.org/git/qemu.git cd qemu git checkout v4.0.0
Build QEMU:
mkdir bin cd bin ../configure --target-list=x86_64-softmmu --enable-kvm make -j24
To install DPDK:
Clone the sources:
git clone git://dpdk.org/dpdk cd dpdk git checkout v20.02
Install dependencies (if needed):
yum install cmake gcc libnl3-devel libudev-devel make pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel -y
Configure DPDK:
export RTE_SDK=$PWD make config T=x86_64-
native
-linuxapp-gcc cd build sed -i's/\(CONFIG_RTE_LIBRTE_MLX5_PMD=\)n/\1y/g'
.config sed -i's/\(CONFIG_RTE_LIBRTE_MLX5_VDPA_PMD=\)n/\1y/g'
.configBuild DPDK:
make -j
Build the vDPA application:
cd $RTE_SDK/examples/vdpa/ make -j
Hardware vDPA Configuration
mkdir -p /hugepages
mount -t hugetlbfs hugetlbfs /hugepages
echo <more> > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo <more> > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
To configure a vDPA VirtIO interface in an existing VM's xml file (using libvirt):
Open the VM's configuration xml for editing:
virsh edit <domain name>
Modify/add the following:
Change the top line to:
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
Assign a memory amount and use 1GB page size for hugepages (size must be the same as used for the vDPA application), so that the memory configuration looks like the following.
<memory unit=
'KiB'
>4194304
</memory> <currentMemory unit='KiB'
>4194304
</currentMemory> <memoryBacking> <hugepages> <page size='1048576'
unit='KiB'
/> </hugepages> </memoryBacking>Assign an amount of CPUs for the VM CPU configuration, so that the vcpu and cputune configuration looks like the following.
<vcpu placement=
'static'
>5
</vcpu> <cputune> <vcpupin vcpu='0'
cpuset='14'
/> <vcpupin vcpu='1'
cpuset='16'
/> <vcpupin vcpu='2'
cpuset='18'
/> <vcpupin vcpu='3'
cpuset='20'
/> <vcpupin vcpu='4'
cpuset='22'
/> </cputune>Set the memory access for the CPUs to be shared, so that the cpu configuration looks like the following.
<cpu mode=
'custom'
match='exact'
check='partial'
> <model fallback='allow'
>Skylake-Server-IBRS</model> <numa> <cell id='0'
cpus='0-4'
memory='8388608'
unit='KiB'
memAccess='shared'
/> </numa> </cpu>Set the emulator in use to be the one built in step "2. Build QEMU" above, so that the emulator configuration looks as follows.
<emulator><path to qemu executable></emulator>
Add a virtio interface using qemu command line argument entries, so that the new interface snippet looks as follows.
<qemu:commandline> <qemu:arg value=
'-chardev'
/> <qemu:arg value='socket,id=charnet1,path=/tmp/sock-virtio0'
/> <qemu:arg value='-netdev'
/> <qemu:arg value='vhost-user,chardev=charnet1,queues=16,id=hostnet1'
/> <qemu:arg value='-device'
/> <qemu:arg value='virtio-net-pci,mq=on,vectors=6
,netdev=hostnet1,id=net1,mac=e4:11
:c6:d3:45
:f2,bus=pci.0
,addr=0x6
, page-per-vq=on,rx_queue_size=1024
,tx_queue_size=1024
'/> </qemu:commandline>Note: In this snippet, the vhostuser socket file path, the amount of queues, the MAC and the PCI slot of the VirtIO device can be configured.
Running Hardware vDPA
Hardware vDPA supports SwitchDev mode only.
Create the ASAP2 environment:
Create the VFs.
Enter switchdev mode.
Set up OVS.
the vDPA application.
cd $RTE_SDK/examples/vdpa/build
./vdpa -w <VF PCI BDF>,class
=vdpa --log-level=pmd,info -- -i
Create a vDPA port via the vDPA application CLI.
create /tmp/sock-virtio0 <PCI DEVICE BDF>
Note: The vhostuser socket file path must be the one used when configuring the VM.
virsh start <domain name>
For further information on the vDPA application, please visit: https://doc.dpdk.org/guides/sample_app_ug/vdpa.html.
Download and install the MFT package corresponding to your computer’s operating system. You would need the kernel-devel or kernel-headers RPM before the tools are built and installed.
The package is available at http://www.mellanox.com => Products => Software => Firmware Tools.
Start the mst driver.
# mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success Loading MST PCI configuration module - Success Create devices
Show the devices status.
ST modules: ------------ MST PCI module loaded MST PCI configuration module loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA ConnectX4lx(rev:
0
) /dev/mst/mt4117_pciconf0.1
04
:00.1
net-enp4s0f1 NA ConnectX4lx(rev:0
) /dev/mst/mt4117_pciconf004
:00.0
net-enp4s0f0 NA # mlxconfig -d /dev/mst/mt4117_pciconf0 q | head -16
Device #1
: ---------- Device type: ConnectX4lx PCI device: /dev/mst/mt4117_pciconf0 Configurations: Current SRIOV_EN True(1
) NUM_OF_VFS8
PF_LOG_BAR_SIZE5
VF_LOG_BAR_SIZE5
NUM_PF_MSIX63
NUM_VF_MSIX11
LINK_TYPE_P1 ETH(2
) LINK_TYPE_P2 ETH(2
)Make sure your configuration is as follows:
* SR-IOV is enabled (SRIOV_EN=1)
* The number of enabled VFs is enough for your environment (NUM_OF_VFS=N)
* The port’s link type is Ethernet (LINK_TYPE_P1/2=2) when applicable
If this is not the case, use mlxconfig to enable that, as follows:Enable SR-IOV.
# mlxconfig -d /dev/mst/mt4115_pciconf0 s SRIOV_EN=
1
Set the number of required VFs.
# mlxconfig -d /dev/mst/mt4115_pciconf0 s NUM_OF_VFS=
8
Set the link type to Ethernet.
# mlxconfig -d /dev/mst/mt4115_pciconf0 s LINK_TYPE_P1=
2
# mlxconfig -d /dev/mst/mt4115_pciconf0 s LINK_TYPE_P2=2
Reset the firmware.
# mlxfwreset -d /dev/mst/mt4115_pciconf0 reset
Query the firmware to make sure everything is set correctly.
# mlxconfig -d /dev/mst/mt4115_pciconf0 q