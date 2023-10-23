This section describes a mechanism for adding a SR-IOV VF to a Virtual Machine.

Run the virt-manager. Double click on the virtual machine and open its Properties. Go to Details → Add hardware → PCI host device. Choose a Mellanox virtual function according to its PCI device (e.g., 00:03.1) If the Virtual Machine is up reboot it, otherwise start it. Log into the virtual machine and verify that it recognizes the Mellanox card. Run: lspci | grep Mellanox Example: lspci | grep Mellanox 00 : 03.0 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX- 3 Virtual Function] (rev b0) [ConnectX-3/ConnectX-3 Pro] Add the device to the /etc/sysconfig/network-scripts/ifcfg-ethX configuration file. The MAC address for every virtual function is configured randomly, therefore it is not necessary to add it.

SR-IOV Virtual function configuration can be done through Hypervisor iprout2/netlink tool, if present. Otherwise, it can be done via sysfs.

ip link set { dev DEVICE | group DEVGROUP } [ { up | down } ] ... [ vf NUM [ mac LLADDR ] [ vlan VLANID [ qos VLAN-QOS ] ] ... [ spoofchk { on | off} ] ] ... sysfs configuration (ConnectX- 4 ): /sys/ class /net/enp8s0f0/device/sriov/[VF] +-- [VF] | +-- config | +-- link_state | +-- mac | +-- mac_list | +-- max_tx_rate | +-- min_tx_rate | +-- spoofcheck | +-- stats | +-- trunk | +-- trust | +-- vlan

When running ETH ports on VGT, the ports may be configured to simply pass through packets as is from VFs (VLAN Guest Tagging), or the administrator may configure the Hypervisor to silently force packets to be associated with a VLAN/Qos (VLAN Switch Tagging).

In the latter case, untagged or priority-tagged outgoing packets from the guest will have the VLAN tag inserted, and incoming packets will have the VLAN tag removed.

The default behavior is VGT.

To configure VF VST mode, run:

ip link set dev <PF device> vf <NUM> vlan <vlan_id> [qos <qos>]

where:

NUM = 0..max-vf-num

vlan_id = 0..4095

qos = 0..7

For example:

ip link set dev eth2 vf 2 vlan 10 qos 3 - sets VST mode for VF #2 belonging to PF eth2, with vlan_id = 10 and qos = 3

ip link set dev eth2 vf 2 vlan 0 - sets mode for VF 2 back to VGT Note: In ConnectX-3 adapter cards family, switching to VGT mode can also be done by setting vlan_id to 4095.

Guest MAC configuration - by default, guest MAC addresses are configured to be all zeroes. If the administrator wishes the guest to always start up with the same MAC, he/she should configure guest MACs before the guest driver comes up. The guest MAC may be configured by using: ip link set dev <PF device> vf <NUM> mac <LLADDR> For legacy and ConnectX-4 guests, which do not generate random MACs, the administrator should always configure their MAC addresses via IP link, as above.

Spoof checking - Spoof checking is currently available only on upstream kernels newer than 3.1. ip link set dev <PF device> vf <NUM> spoofchk [on | off]

Guest Link State ip link set dev <PF device> vf <UM> state [enable| disable| auto]

Virtual function statistics can be queried via sysfs:

cat /sys/ class /infiniband/mlx5_2/device/sriov/ 2 /stats tx_packets : 5011 tx_bytes : 4450870 tx_dropped : 0 rx_packets : 5003 rx_bytes : 4450222 rx_broadcast : 0 rx_multicast : 0 tx_broadcast : 0 tx_multicast : 8 rx_dropped : 0





To view the VFs mapping to ports:

Use the ip link tool v2.6.34~3 and above.

ip link

Output:

61 : p1p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 00 : 02 :c9:f1: 72 :e0 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00 : 00 : 00 : 00 : 00 : 00 , vlan 4095 , spoof checking off, link-state auto vf 37 MAC 00 : 00 : 00 : 00 : 00 : 00 , vlan 4095 , spoof checking off, link-state auto vf 38 MAC ff:ff:ff:ff:ff:ff, vlan 65535 , spoof checking off, link-state disable vf 39 MAC ff:ff:ff:ff:ff:ff, vlan 65535 , spoof checking off, link-state disable

When a MAC is ff:ff:ff:ff:ff:ff, the VF is not assigned to the port of the net device it is listed under. In the example above, vf38 is not assigned to the same port as p1p1, in contrast to vf0.

However, even VFs that are not assigned to the net device, could be used to set and change its settings. For example, the following is a valid command to change the spoof check:

ip link set dev p1p1 vf 38 spoofchk on

This command will affect only the vf38. The changes can be seen in ip link on the net device that this device is assigned to.

To map the PCI representation in BDF to the respective ports, run:

mlnx_get_vfs.pl

Output:

BDF 0000 : 04 : 00.0 Port 1 : 2 vf0 0000 : 04 : 00.1 vf1 0000 : 04 : 00.2 Port 2 : 2 vf2 0000 : 04 : 00.3 vf3 0000 : 04 : 00.4 Both: 1 vf4 0000 : 04 : 00.5





RoCE is supported on Virtual Functions and VLANs may be used with it. For RoCE, the hypervisor GID table size is of 16 entries while the VFs share the remaining 112 entries. When the number of VFs is larger than 56 entries, some of them will have GID table with only a single entry which is inadequate if VF's Ethernet device is assigned with an IP address.

When setting num_vfs in mlx4_core module parameter it is important to check that the number of the assigned IP addresses per VF does not exceed the limit for GID table size.

Port Type management is static when enabling SR-IOV (the connectx_port_config script will not work). The port type is set on the Host via a module parameter, port_type_array, in mlx- 4_core. This parameter may be used to set the port type uniformly for all installed ConnectX® HCAs, or it may specify an individual configuration for each HCA.

This parameter should be specified as an options line in the file /etc/modprobe.d/mlx- 4_core.conf.

For example, to configure all HCAs to have Port1 as IB and Port2 as ETH, insert the following line:

options mlx4_core port_type_array= 1 , 2

To set HCAs individually, you may use a string of Domain:bus:device.function=x;y

For example, if you have a pair of HCAs, whose PFs are 0000:04:00.0 and 0000:05:00.0, you may specify that the first will have both ports as IB, and the second will have both ports as ETH as follows:

options mlx4_core port_type_array=' 0000 : 04 : 00.0 - 1 ; 1 , 0000 : 05 : 00.0 - 2 ; 2

Warning Only the PFs are set via this mechanism. The VFs inherit their port types from their associated PF.

Each VF presents itself as an independent vHCA to the host, while a single HCA is observable by the network which is unaware of the vHCAs. No changes are required by the InfiniBand sub- system, ULPs, and applications to support SR-IOV, and vHCAs are interoperable with any exist- ing (non-virtualized) IB deployments.

Sharing the same physical port(s) among multiple vHCAs is achieved as follows:

Each vHCA port presents its own virtual GID table

For further details, please refer to Configuring an Alias GUID (under ports/<n>/admin_guids)

Each vHCA port presents its own virtual PKey table

The virtual PKey table (presented to a VF) is a mapping of selected indexes of the physical PKey table. The host admin can control which PKey indexes are mapped to which virtual indexes using a sysfs interface. The physical PKey table may contain both full and partial memberships of the same PKey to allow different membership types in different virtual tables.

Each vHCA port has its own virtual port state

A vHCA port is up if the following conditions apply: The physical port is up The virtual GID table contains the GIDs requested by the host admin The SM has acknowledged the requested GIDs since the last time that the physical port went up

Other port attributes are shared, such as: GID prefix, LID, SM LID, LMC mask

To allow the host admin to control the virtual GID and PKey tables of vHCAs, a new sysfs 'iov sub-tree has been added under the PF InfiniBand device.

Warning If the vHCA comes up without a GUID, make sure you are running the latest version of SM/OpenSM. The SM on QDR switches do not support SR-IOV.





Administration of GUIDs and PKeys is done via the sysfs interface in the Hypervisor (Dom0). This interface is under:

/sys/ class /infiniband/<infiniband device>/iov

Under this directory, the following subdirectories can be found:

ports - The actual (physical) port resource tables

Port GID tables: ports/<n>/gids/<n> where 0 <= n <= 127 (the physical port gids) ports/<n>/admin_guids/<n> where 0 <= n <= 127 (allows examining or changing the administrative state of a given GUID> ports/<n>/pkeys/<n> where 0 <= n <= 126 (displays the contents of the physical pkey table)

<pci id> directories - one for Dom0 and one per guest. Here, you may see the map- ping between virtual and physical pkey indices, and the virtual to physical gid 0.

Currently, the GID mapping cannot be modified, but the pkey virtual to physical mapping can.

These directories have the structure: <pci_id>/port/<m>/gid_idx/0 where m = 1..2 (this is read-only)

and <pci_id>/port/<m>/pkey_idx/<n>, where m = 1..2 and n = 0..126



For instructions on configuring pkey_idx, please see below.

Determine the GUID index of the PCI Virtual Function that you want to pass through to a guest.

For example, if you want to pass through PCI function 02:00.3 to a certain guest, you initially need to see which GUID index is used for this function.

To do so: Copy Copied! cat /sys/ class /infiniband/mlx4_0/iov/ 0000 : 02 : 00.3 /port/<port_num>/gid_idx/ 0 The value returned will present which guid index to modify on Dom0. Modify the physical GUID table via the admin_guids sysfs interface.

To configure the GUID at index <n> on port <port_num>: Copy Copied! echo NEWGUID > /sys/ class /infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/<guid_in- dex> Example: Copy Copied! echo "0x002fffff8118" > /sys/ class /infiniband/mlx4_0/iov/ports/ 1 /admin_guids/ 3 Note: /sys/class/infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/0 is read only and cannot be changed. Read the administrative status of the GUID index.

To read the administrative status of GUID index <guid_index> on port number <port_- num>:: Copy Copied! cat /sys/ class /infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/<guid_index> Check the operational state of a GUID. Copy Copied! /sys/ class /infiniband/mlx4_0/iov/ports/<port_num>/gids (where port_num = 1 or 2 ) The values indicate what gids are actually configured on the firmware/hardware, and all the entries are R/O. Compare the value you read under the "admin_guids" directory at that index with the value under the "gids" directory, to verify the change requested in Step 3 has been accepted by the SM, and programmed into the hardware port GID table.

If the value under admin_guids/<m> is different that the value under gids/<m>, the request is still in progress.

Admin VF GUIDs

As of MLNX_OFED v3.0, the query_gid verb (e.g. ib_query_gid()) returns the admin desired value instead of the value that was approved by the SM to prevent a case where the SM is unreachable or a response is delayed, or if the VF is probed into a VM before their GUID is registered with the SM. If one of the above scenarios occurs, the VF sees an incorrect GID (i.e., not the GID that was intended by the admin).

Despite the new behavior, if the SM does not approve the GID, the VF sees its link as down.

On Demand GUIDs

GIDs are requested from the SM on demand, when needed by the VF (e.g. become active), and are released when the GIDs are no longer in use.

Since a GID is assigned to a VF on the destination HCA, while the VF on the source HCA is shut down (but not administratively released), using GIDs on demand eases the GID migrations.

For compatibility reasons, an explicit admin request to set/change a GUID entry is done immediately, regardless of whether the VF is active or not to allow administrators to change the GUID without the need to unbind/bind the VF.

Alias GUIDs Default Mode

Due to the change in the Alias GUID support in InfiniBand behavior, its default mode is now set as HOST assigned instead of SM assigned. To enable out-of-the-box experience, the PF generates random GUIDs as the initial admin values instead of asking the SM.

Initial GUIDs' Values

Initial GUIDs' values depend on the mlx4_ib module parameter 'sm_guid_assign' as follows:

Mode Type Description admin assigned Each admin_guid entry has the random generated GUID value. sm assigned Each admin_guid entry for non-active VFs has a value of 0. Meaning, asking a GUID from the SM upon VF activation. When a VF is active, the returned value from the SM becomes the admin value to be asked later again.

When a VF becomes active, and its admin value is approved, the operational GID entry is changed accordingly. In both modes, the administrator can set/delete the value by using the sysfs Administration Interfaces on the Hypervisor as described above.

Each VF has a single GUID entry in the table based on the VF number. (e.g. VF 1 expects to use GID entry 1). To determine the GUID index of the PCI Virtual Function to pass to a guest, use the sysfs mechanism <gid_idx> directory as described above.

Once admin request is rejected by the SM, a retry mechanism is set. Retry time is set to 1 second, and for each retry it is multiplied by 2 until reaching the maximum value of 60 seconds. Additionally, when looking for the next record to be updated, the record having the lowest time to be executed is chosen.

Any value reset via the admin_guid interface is immediately executed and it resets the entry's timer.

PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by mapping a non-default full-membership PKey to virtual index 0, and mapping the default PKey to a virtual pkey index other than zero.

The below describes how to set up two hosts, each with 2 Virtual Machines. Host-1/vm-1 will be able to communicate via IPoIB only with Host2/vm1,and Host1/vm2 only with Host2/vm2.

In addition, Host1/Dom0 will be able to communicate only with Host2/Dom0 over ib0. vm1 and vm2 will not be able to communicate with each other, nor with Dom0.

This is done by configuring the virtual-to-physical PKey mappings for all the VMs, such that at virtual PKey index 0, both vm-1s will have the same pkey and both vm-2s will have the same

PKey (different from the vm-1's), and the Dom0's will have the default pkey (different from the vm's pkeys at index 0).

OpenSM must be used to configure the physical Pkey tables on both hosts.

The physical Pkey table on both hosts (Dom0) will be configured by OpenSM to be: index 0 = 0xffff index 1 = 0xb000 index 2 = 0xb030

The vm1's virt-to-physical PKey mapping will be: pkey_idx 0 = 1 pkey_idx 1 = 0

The vm2's virt-to-phys pkey mapping will be: pkey_idx 0 = 2 pkey_idx 1 = 0 So that the default pkey will reside on the vms at index 1 instead of at index 0.

The IPoIB QPs are created to use the PKey at index 0. As a result, the Dom0, vm1 and vm2 IPoIB QPs will all use different PKeys.

To partition IPoIB communication using PKeys:

Create a file "/etc/opensm/partitions.conf" on the host on which OpenSM runs, containing lines. Default= 0x7fff ,ipoib : ALL=full ; Pkey1= 0x3000 ,ipoib : ALL=full; Pkey3= 0x3030 ,ipoib : ALL=full; This will cause OpenSM to configure the physical Port Pkey tables on all physical ports on the network as follows: pkey idx | pkey value --------- |--------- 0 | 0xFFFF 1 | 0xB000 2 | 0xB030 The most significant bit indicates if a PKey is a full PKey. Warning The ",ipoib" causes OpenSM to pre-create IPoIB the broadcast group for the indicated PKeys. Configure (on Dom0) the virtual-to-physical PKey mappings for the VMs. Check the PCI ID for the Physical Function and the Virtual Functions. lspci | grep Mel Assuming that on Host1, the physical function displayed by lspci is "0000:02:00.0", and that on Host2 it is "0000:03:00.0"

On Host1 do the following. cd /sys/ class /infiniband/mlx4_0/iov 0000 : 02 : 00.0 0000 : 02 : 00.1 0000 : 02 : 00.2 ... Note: 0000:02:00.0 contains the virtual-to-physical mapping tables for the physical function. 0000:02:00.X contain the virt-to-phys mapping tables for the virtual functions. Do not touch the Dom0 mapping table (under : :00.0). Modify only tables under 0000:02:00.1 and/or 0000:02:00.2. We assume that vm1 uses VF 0000:02:00.1 and vm2 uses VF 0000:02:00.2 Configure the virtual-to-physical PKey mapping for the VMs. echo 0 > 0000 : 02 : 00.1 /ports/ 1 /pkey_idx/ 1 echo 1 > 0000 : 02 : 00.1 /ports/ 1 /pkey_idx/ 0 echo 0 > 0000 : 02 : 00.2 /ports/ 1 /pkey_idx/ 1 echo 2 > 0000 : 02 : 00.2 /ports/ 1 /pkey_idx/ 0 vm1 pkey index 0 will be mapped to physical pkey-index 1, and vm2 pkey index 0 will be mapped to physical pkey index 2. Both vm1 and vm2 will have their pkey index 1 mapped to the default pkey. On Host2 do the following. cd /sys/ class /infiniband/mlx4_0/iov echo 0 > 0000 : 03 : 00.1 /ports/ 1 /pkey_idx/ 1 echo 1 > 0000 : 03 : 00.1 /ports/ 1 /pkey_idx/ 0 echo 0 > 0000 : 03 : 00.2 /ports/ 1 /pkey_idx/ 1 echo 2 > 0000 : 03 : 00.2 /ports/ 1 /pkey_idx/ 0 Once the VMs are running, you can check the VM's virtualized PKey table by doing (on the vm). cat /sys/ class /infiniband/mlx4_0/ports/[ 1 , 2 ]/pkeys/[ 0 , 1 ] Start up the VMs (and bind VFs to them). Configure IP addresses for ib0 on the host and on the guests.

Until now, in MLNX_OFED, administrators were unable to run network diagnostics from a VF since sending and receiving Subnet Management Packets (SMPs) from a VF was not allowed, for security reasons: SMPs are not restricted by network partitioning and may affect the physical network topology. Moreover, even the SM may be denied access from portions of the network by setting management keys unknown to the SM.

However, it is desirable to grant SMP capability to certain privileged VFs, so certain network management activities may be conducted within virtual machines rather than only on the hypervisor.

To enable SMP capability for a VF, one must enable the Subnet Management Interface (SMI) for that VF. By default, the SMI interface is disabled for VFs. To enable SMI mads for VFs, there are two new sysfs entries per VF per on the Hypervisor (under /sys/class/infiniband/mlx4_X/ iov/<b.d.f>/ports/<1 or 2>. These entries are displayed only for VFs (not for the PF), and only for IB ports (not ETH ports).

The first entry, enable_smi_admin, is used to enable SMI on a VF. By default, the value of this entry is zero (disabled). When set to "1", the SMI will be enabled for the VF on the next rebind or openibd restart on the VM that the VF is bound to. If the VF is currently bound, it must be unbound and then re-bound.

The second sysfs entry, smi_enabled, indicates the current enablement state of the SMI. 0 indicates disabled, and 1 indicates enabled. This entry is read-only.

When a VF is initialized (bound), during the initialization sequence, the driver copies the requested smi_state (enable_smi_admin) for that VF/port to the operational SMI state (smi_enabled) for that VF/port, and operate according to the operational state.

Thus, the sequence of operations on the hypervisor is:

Enable SMI for any VF/port that you wish. Restart the VM that the VF is bound to (or just run /etc/init.d/openibd restart on that VM)

The SMI will be enabled for the VF/port combinations that you set in step 2 above. You will then be able to run network diagnostics from that VF.

To install MLNX_OFED on a VF which will be enabled to run the tools, run the following on the VM:

mlnx_en_install

FDB also know as Forwarding Information Base (FIB) or the forwarding table, is most commonly used in network bridging, routing, and similar functions to find the proper interface to which the input interface should forward a packet.

In the SR-IOV environment, the Ethernet driver can share the existing 128 MACs (for each port) among the Virtual interfaces (VF) and Physical interfaces (PF) that share the same table as follow:

Each VF gets 2 granted MACs (which are taken from the general pool of the 128 MACs)

Each VF/PF can ask for up to 128 MACs on the policy of first-asks first-served (meaning, except for the 2 granted MACs, the other MACs in the pool are free to be asked)

To check if there are free MACs for its interface (PF or VF), run:

/sys/ class /net/<ethX>/ fdb_det.

Example:

cat /sys/ class /net/eth2/fdb_det device eth2: max: 112 , used: 2 , free macs: 110

To add a new MAC to the interface:

echo +<MAC> > /sys/ class /net/eth<X>/fdb

Once running the command above, the interface (VF/PF) verifies if a free MAC exists. If there is a free MAC, the VF/PF takes it from the global pool and allocates it. If there is no free MAC, an error is returned notifying the user of lack of MACs in the pool.

To delete a MAC from the interface:

echo -<MAC> > /sys/ class /net/eth<X>/fdb

If /sys/class/net/eth<X>/fdb does not exist, use the Bridge tool from the ip-route2 package which includes the tool to manage FDB tables as the kernel supports FDB callbacks:

bridge fdb add 00 : 01 : 02 : 03 : 04 : 05 permanent self dev p3p1 bridge fdb del 00 : 01 : 02 : 03 : 04 : 05 permanent self dev p3p1 bridge fdb show dev p3p1

Warning If adding a new MAC from the kernel's NDO function fails due to insufficient MACs in

the pool, the following error flow will occur: If the interface is a PF, it will automatically enter the promiscuous mode

If the interface is a VF, it will try to enter the promiscuous mode and since it does not support it, the action will fail and an error will be printed in the kernel's log

VGT+ is an advanced mode of Virtual Guest Tagging (VGT), in which a VF is allowed to tag its own packets as in VGT, but is still subject to an administrative VLAN trunk policy. The policy determines which VLAN IDs are allowed to be transmitted or received. The policy does not determine the user priority, which is left unchanged.

Packets can be sent in one of the following modes: when the VF is allowed to send/receive untagged and priority tagged traffic and when it is not. No default VLAN is defined for VGT+ port. The send packets are passed to the eSwitch only if they match the set, and the received packets are forwarded to the VF only if they match the set.

Warning In some old OSs, such as SLES11 SP4, any VLAN can be created in the VM, regardless of the VGT+ configuration, but traffic will only pass for the allowed VLANs.

The following are the current VGT+ limitations:

The size of the VLAN set is defined to be up to 10 VLANs including the VLAN 0 that is added for untagged/priority-tagged traffic

This behavior applies to all VF traffic: plain Ethernet, and all RoCE transports

VGT+ allowed VLAN sets may be only extended when the VF is online

An operational VLAN set becomes identical as the administration VLAN set only after a VF reset

VGT+ is available in DMFS mode only

The default operating mode is VGT:

cat /sys/ class /net/eth5/vf0/vlan_set oper: admin:

Both states (operational and administrative) are empty.

Warning If you set the vlan_set parameter with more the 10 VLAN IDs, the driver chooses the first 10 VLAN IDs provided and ignores all the rest.

To enable VGT+ mode:

Set the corresponding port/VF (in the example below port eth5 VF0) list of allowed VLANs. echo 0 1 2 3 4 5 6 7 8 9 > /sys/ class /net/eth5/vf0/vlan_set Where 0 specifies if untagged/priority tagged traffic is allowed.

Meaning if the below command is run, you will not be able to send/receive untagged traffic. echo 1 2 3 4 5 6 7 8 9 10 > /sys/ class /net/eth5/vf0/vlan_set Reboot the relevant VM for changes to take effect, or run: /etc/init.d/openibd restart

To disable VGT+ mode:

Set the VLAN. echo > /sys/ class /net/eth5/vf0/vlan_set Reboot the relevant VM for changes to take effect, or run: /etc/init.d/openibd restart

To add a VLAN:

In the example below, the following state exist:

cat /sys/ class /net/eth5/vf0/vlan_set oper: 0 1 2 3 admin: 0 1 2 3

Make an operational VLAN set identical to the administration VLAN. echo 2 3 4 5 6 > /sys/ class /net/eth5/vf0/vlan_set The delta will be added to the operational state immediately (4 5 6): cat /sys/ class /net/eth5/vf0/vlan_set oper: 0 1 2 3 4 5 6 admin: 2 3 4 5 6 Reset the VF for changes to take effect.

Warning When working in SR-IOV, the default operating mode is VGT.

To enable VGT+ mode:

Set the corresponding port/VF (in the example below port eth5, VF0) range of allowed VLANs.

echo "<add> <start_vid> <end_vid>" > /sys/ class /net/eth5/device/sriov/ 0 /trunk

Examples:

Adding VLAN ID range (4-15) to trunk: echo add 4 15 > /sys/ class /net/eth5/device/sriov/ 0 /trunk

Adding a single VLAN ID to trunk: echo add 17 17 > /sys/ class /net/eth5/device/sriov/ 0 /trunk

Note: When VLAN ID = 0, it indicates that untagged and priority-tagged traffics are allowed

To disable VGT+ mode, make sure to remove all VLANs.

echo rem 0 4095 > /sys/ class /net/eth5/device/sriov/ 0 /trunk

To remove selected VLANs.

Remove VLAN ID range (4-15) from trunk: echo rem 4 15 > /sys/ class /net/eth5/device/sriov/ 0 /trunk

Remove a single VLAN ID from trunk: Copy Copied! echo rem 17 17 > /sys/ class /net/eth5/device/sriov/ 0 /trunk

Virtualized QoS per VF, (supported in ConnectX®-3/ConnectX®-3 Pro adapter cards only with firmware v2.33.5100 and above), limits the chosen VFs' throughput rate limitations (Maximum throughput). The granularity of the rate limitation is 1Mbits.

The feature is disabled by default. To enable it, set the "enable_vfs_qos" module parameter to "1" and add it to the "options mlx4_core". When set, and when feature is supported, it will be shown upon PF driver load time (in DEV_CAP in kernel log: GranularQoSRatelimitperVFsupport), when mlx4_core module parameter debug_level is set to 1. For further information, please refer to " mlx4_core Parameters " - debug_level parameter).

When set, and supported by the firmware, running as SR-IOV Master and Ethernet link, the driver also provides information on the number of total available vPort Priority Pair (VPPs) and how many VPPs are allocated per priority. All the available VPPs will be allocated on priority 0.

Copy Copied! mlx4_core 0000 :1b: 00.0 : Port 1 Available VPPs 63 mlx4_core 0000 :1b: 00.0 : Port 1 UP 0 Allocated 63 VPPs mlx4_core 0000 :1b: 00.0 : Port 1 UP 1 Allocated 0 VPPs mlx4_core 0000 :1b: 00.0 : Port 1 UP 2 Allocated 0 VPPs mlx4_core 0000 :1b: 00.0 : Port 1 UP 3 Allocated 0 VPPs mlx4_core 0000 :1b: 00.0 : Port 1 UP 4 Allocated 0 VPPs mlx4_core 0000 :1b: 00.0 : Port 1 UP 5 Allocated 0 VPPs mlx4_core 0000 :1b: 00.0 : Port 1 UP 6 Allocated 0 VPPs mlx4_core 0000 :1b: 00.0 : Port 1 UP 7 Allocated 0 VPPs

Warning The rate limit configuration will take effect only when the VF is in VST mode configured with priority 0.

Rate limit can be configured using the iproute2/netlink tool.

Copy Copied! ip link set dev <PF device> vf <NUM> rate <TXRATE>

where:

NUM = 0...<Num of VF>

<TXRATE> in units of 1Mbit/s

The rate limit for VF can be configured:

While setting it to the VST mode. Copy Copied! ip link set dev <PF device> vf <NUM> vlan <vlan_id> [qos <qos>] rate <TXRATE>

Before the VF enters the VST mode with a supported priority.

In this case, the rate limit value is saved and the rate limit configuration is applied when VF state is changed to VST mode.

To disable rate limit configured for a VF set the VF with rate 0. Once the rate limit is set, you cannot switch to VGT or change VST priority.

To view current rate limit configurations for VFs, use the iproute2 tool.

Copy Copied! ip link show dev <PF device>

Example:

Copy Copied! 89 : eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 link/ether f4: 52 : 14 :5e:be: 20 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00 : 00 : 00 : 00 : 00 : 00 , vlan 2 , tx rate 1500 (Mbps), spoof checking off, link-state auto vf 1 MAC 00 : 00 : 00 : 00 : 00 : 00 , vlan 4095 , spoof checking off, link-state auto vf 2 MAC 00 : 00 : 00 : 00 : 00 : 00 , vlan 4095 , spoof checking off, link-state auto vf 3 MAC 00 : 00 : 00 : 00 : 00 : 00 , vlan 4095 , spoof checking off, link-state auto

On some OSs, the iptool may not display the configured rate, or any of the VF information, although the both the VST and the rate limit are set through the netlink command. In order to view the rate limit configured, use sysfs provided by the driver. Its location can be found at:

Copy Copied! /sys/ class /net/<eth-x>/<vf-i>/tx_rate

Normally, MAC addresses are unique identifiers assigned to network interfaces, and they are fixed addresses that cannot be changed. MAC address spoofing is a technique for altering the MAC address to serve different purposes. Some of the cases in which a MAC address is altered can be legal, while others can be illegal and abuse security mechanisms or disguises a possible attacker.

The SR-IOV MAC address anti-spoofing feature, also known as MAC Spoof Check provides protection against malicious VM MAC address forging. If the network administrator assigns a MAC address to a VF (through the hypervisor) and enables spoof check on it, this will limit the end user to send traffic only from the assigned MAC address of that VF.

Warning MAC anti-spoofing is disabled by default.

In the configuration example below, the VM is located on VF-0 and has the following MAC address: 11:22:33:44:55:66.

There are two ways to enable or disable MAC anti-spoofing:

Use the standard IP link commands - available from Kernel 3.10 and above. To enable MAC anti-spoofing, run: Copy Copied! ip link set ens785f1 vf 0 spoofchk on To disable MAC anti-spoofing, run: Copy Copied! ip link set ens785f1 vf 0 spoofchk off Specify echo "ON" or "OFF" to the file located under /sys/class/net/<ETH_IF_NAME> / device/sriov/<VF index>/spoofchk. To enable MAC anti-spoofing, run: Copy Copied! echo "ON" > /sys/ class /net/ens785f1/vf/ 0 /spoofchk To disable MAC anti-spoofing, run: Copy Copied! echo "OFF" > /sys/ class /net/ens785f1/vf/ 0 /spoofchk

Warning This configuration is non-persistent and does not survive driver restart.

Warning In order for spoof-check enabling/disabling to take effect while the VF is up and running on ConnectX-3 Pro adapter cards, it is required to perform a driver restart on the guest OS.

This feature enables rate limiting traffic per VF in SR-IOV mode for ConnectX-4/ConnectX-4 Lx/ConnectX-5 adapter cards. For details on how to configure rate limit per VF for ConnectX-4/ ConnectX-5, refer to HowTo Configure Rate Limit per VF for ConnectX-4/ConnectX-5 Community post.

VFs Rate Limit for vSwitch (OVS) feature allows users to join available VFs into groups and set a rate limitation on each group. Rate limitation on a VF group ensures that the total Tx bandwidth that the VFs in this group get (altogether combined) will not exceed the given value.

With this feature, a VF can still be configured with an individual rate limit as in the past (under /sys/class/net/<ifname>/device/sriov/<vf_num>/max_tx_rate). However, the actual bandwidth limit on the VF will eventually be determined considering the VF group limitation and how many VFs are in the same group.

For example: 2 VFs (0 and 1) are attached to group 3.

Case 1: The rate limitation on the group is set to 20G. Rate limit of each VF is 15G

Result: Each VF will have a rate limit of 10G

Case 2: Group’s max rate limitation is still set to 20G. VF 0 is configured to 30G limit, while VF 1 is configured to 5G rate limit

Result: VF 0 will have 15G de-facto. VF 1 will have 5G

The rule of thumb is that the group’s bandwidth is distributed evenly between the number of VFs in the group. If there are leftovers, they will be assigned to VFs whose individual rate limit has not been met yet.

When VF rate group is supported by FW, the driver will create a new hierarchy in the SRI-OV sysfs named “groups” (/sys/class/net/<ifname>/device/sriov/groups/). It will contain all the info and the configurations allowed for VF groups. All VFs are placed in group 0 by default since it is the only existing group following the initial driver start. It would be the only group available under /sys/class/net/<ifname>/device/sriov/groups/ The VF can be moved to a different group by writing to the group file -> echo $GROUP_ID > /sys/class/net/<ifname>/device/sriov/<vf_id>/group The group IDs allowed are 0-255 Only when there is at least 1 VF in a group, there will be a group configuration available under /sys/class/net/<ifname>/device/sriov/groups/ (Except for group 0, which is always available even when it’s empty). Once the group is created (by moving at least 1 VF to that group), users can configure the group’s rate limit. For example: echo 10000 > /sys/class/net/<ifname>/device/sriov/5/max_tx_rate – setting individual rate limitation of VF 5 to 10G (Optional) echo 7 > /sys/class/net/<ifname>/device/sriov/5/group – moving VF 5 to group 7 echo 5000 > /sys/class/net/<ifname>/device/sriov/groups/7/max_tx_rate – setting group 7 with rate limitation of 5G When running traffic via VF 5 now, it will be limited to 5G because of the group rate limit even though the VF itself is limited to 10G echo 3 > /sys/class/net/<ifname>/device/sriov/5/group – moving VF 5 to group 3 Group 7 will now disappear from /sys/class/net/<ifname>/device/sriov/groups since there are 0 VFs in it. Group 3 will now appear. Since there’s no rate limit on group 3, VF 5 can transmit at 10G (thanks to its individual configuration)

You can see to which group the VF belongs to in the ‘stats’ sysfs (cat /sys/class/net/<ifname>/device/sriov/<vf_num>/stats)

You can see the current rate limit and number of attached VFs to a group in the group’s ‘config’ sysfs (cat /sys/class/net/<ifname>/device/sriov/groups/<group_id>/config)

Bandwidth guarantee (minimum BW) can be set on a group of VFs to ensure this group is able to transmit at least the amount of bandwidth specified on the wire.

Note the following:

The minimum BW settings on VF groups determine how the groups share the total BW between themselves. It does not impact an individual VF’s rate settings.

The total minimum BW that is set on the VF groups should not exceed the total line rate. Otherwise, results are unexpected.

It is still possible to set minimum BW on the individual VFs inside the group. This will determine how the VFs share the group’s minimum BW between themselves. The total minimum BW of the VF member should not exceed the minimum BW of the group.

For instruction on how to create groups of VFs, see Limit Bandwidth per Group of VFs above.

Example

With a 40Gb link speed, assuming 4 groups and default group 0 have been created:

Copy Copied! echo 20000 > /sys/ class /net/<ifname>/device/sriov/group/ 1 /min_tx_rate echo 5000 > /sys/ class /net/<ifname>/device/sriov/group/ 2 /min_tx_rate echo 15000 > /sys/ class /net/<ifname>/device/sriov/group/ 3 /min_tx_rate

Copy Copied! Group 0 ( default ) : 0 - No BW guarantee is configured. Group 1 : 20000 - This is the maximum min rate among groups Group 2 : 5000 which is 25 % of the maximum min rate Group 3 : 15000 which is 75 % of the maximum min rate Group 4 : 0 - No BW guarantee is configured.

Assuming there are VFs attempting to transmit in full line rate in all groups, the results would look like: In which case, the minimum BW allocation would be:

Copy Copied! Group0 – Will have no BW to use since no BW guarantee was set on it while other groups do have such settings. Group1 – Will transmit at 20Gb/s Group2 – Will transmit at 5Gb/s Group3 – Will transmit at 15Gb/s Group4 - Will have no BW to use since no BW guarantee was set on it while other groups do have such settings.

In case a malicious driver is running over one of the VFs, and in case that VF's permissions are not restricted, this may open security holes. However, VFs can be marked as trusted and can thus receive an exclusive subset of physical function privileges or permissions. For example, in case of allowing all VFs, rather than specific VFs, to enter a promiscuous mode as a privilege, this will enable malicious users to sniff and monitor the entire physical port for incoming traffic, including traffic targeting other VFs, which is considered a severe security hole.

In the configuration example below, the VM is located on VF-0 and has the following MAC address: 11:22:33:44:55:66.

There are two ways to enable or disable trust:

Use the standard IP link commands - available from Kernel 4.5 and above. To enable trust for a specific VF, run: Copy Copied! ip link set ens785f1 vf 0 trust on To disable trust for a specific VF, run: Copy Copied! ip link set ens785f1 vf 0 trust off Specify echo "ON" or "OFF" to the file located under /sys/class/net/<ETH_IF_NAME> / device/sriov/<VF index>/trust.

To enable trust for a specific VF, run: Copy Copied! echo "ON" > /sys/ class /net/ens785f1/device/sriov/ 0 /trust To disable trust for a specific VF, run: Copy Copied! echo "OFF" > /sys/ class /net/ens785f1/device/sriov/ 0 /trust

Probing Virtual Functions (VFs) after SR-IOV is enabled might consume the adapter cards' resources. Therefore, it is recommended not to enable probing of VFs when no monitoring of the VM is needed.

VF probing can be disabled in two ways, depending on the kernel version installed on your server:

If the kernel version installed is v4.12 or above, it is recommended to use the PCI sysfs interface sriov_drivers_autoprobe . For more information, see linux-next branch . If the kernel version installed is older than v4.12, it is recommended to use the mlx5_core module parameter probe_vf with MLNX_OFED v4.1 or above.

Example:

Copy Copied! echo 0 > /sys/module/mlx5_core/parameters/probe_vf

For more information on how to probe VFs, see HowTo Configure and Probe VFs on mlx5 Driv ers Community post.

VFs can enter a promiscuous mode that enables receiving the unmatched traffic and all the multicast traffic that reaches the physical port in addition to the traffic originally targeted to the VF. The unmatched traffic is any traffic's DMAC that does not match any of the VFs' or PFs' MAC addresses.

Note: Only privileged/trusted VFs can enter the VF promiscuous mode.

To set the promiscuous mode on for a VF, run:

Copy Copied! ifconfig eth2 promisc

To exit the promiscuous mode, run:

Copy Copied! ifconfig eth2 –promisc





VFs can enter an all-multi mode that enables receiving all the multicast traffic sent from/to the other functions on the same physical port in addition to the traffic originally targeted to the VF.

Note: Only privileged/trusted VFs can enter the all-multi RX mode.

To set the all-multi mode on for a VF, run:

Copy Copied! ifconfig eth2 allmulti

To exit the all-multi mode, run: