Advanced Features
Packets transmitted over an offloaded socket may be rate-limited, thus, allowing granular rate control over the software defined flows. A rate-limited flow is allowed to transmit a few packets (burst) before its transmission rate is evaluated, and next packet is scheduled for transmission accordingly.
This is a simple form of Packet Pacing supporting basic functionalities. For advanced Packing Pacing support and wide-range specification, please refer to Rivermax library.
Prerequisites
MLNX_OFED version 4.1-x.x.x.x and above
VMA supports packet pacing with NVIDIA® ConnectX®-5 devices.
If you have MLNX_OFED installed, you can verify whether your NIC supports packet pacing by running:ibv_devinfo –v
Check the supported pace range under the section packet_pacing_caps (this range is in Kbit per second).
packet_pacing_caps: qp_rate_limit_min: 1kbps qp_rate_limit_max: 100000000kbps
Usage
Ø To apply Packet Pacing to a socket:
Run VMA with VMA_RING_ALLOCATION_LOGIC_TX=10.
Set the SO_MAX_PACING_RATE option for the socket:
uint32_t val = [rate in bytes per second]; setsockopt(fd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
Notes:
VMA converts the setsockopt value from bytes per second to Kbit per second.
It is possible that the socket may be used over multiple NICs, some of which support Packet Pacing and some do not. Hence, setting the SO_MAX_PACING_RATE socket option does not guarantee that Packet Pacing will be applied.
In case of a failure when setting the packet pacing an error log will be printed to screen and no pacing will be done.
VMA supports hardware timestamping for UDP-RX flow (only) with Precision Time Protocol (PTP).
When using VMA on a server running a PTP daemon, VMA can periodically query the kernel to obtain updated time conversion parameters which it uses in conjunction with the hardware time-stamp it receives from the NIC to provide synchronized time.
Prerequisites
Support devices: HCA clock available (NVIDIA® ConnectX®-4 and above)
Set VMA_HW_TS_CONVERSION environment variable to 4
Usage
Set the SO_TIMESTAMPING option for the socket with value SOF_TIMESTAMPING_RX_HARDWARE:
uint8_t val = SOF_TIMESTAMPING_RX_HARDWARE setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
Set VMA environment parameter VMA_HW_TS_CONVERSION to 4.
Example:
Use the Linux kernel (v4.11) timestamping example found in the kernel source at: tools/testing/selftests/networking/timestamping/timestamping.c.
Server
$ sudo LD_PRELOAD=libvma.so VMA_HW_TS_CONVERSION=4
./timestamping <iface> SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_RX_HARDWARE
Client
$ LD_PRELOAD=libvma.so sockperf tp -i <server-ip> -t 3600
-p 6666
--mps 10
timestamping output:
SOL_SOCKET SO_TIMESTAMPING SW 0.000000000
HW raw 1497823023.070846953
IP_PKTINFO interface
index 8
SOL_SOCKET SO_TIMESTAMPING SW 0.000000000
HW raw 1497823023.170847260
IP_PKTINFO interface
index 8
SOL_SOCKET SO_TIMESTAMPING SW 0.000000000
HW raw 1497823023.270847093
IP_PKTINFO interface
index 8
On-Device Memory is supported in ConnectX-5 adapter cards and above.
Each PCI transaction between the system’s RAM and NIC starts at ~300 nsec (and increasing depended on buffer size). Application egress latency can be improved by reducing as many PCI transition as possible on the send path.
Today, VMA achieves these goals by copying the WQE into the doorbell, and for small packets (<190 Bytes payload) VMA can inline the packet into the WQE and reduce the data gather PCI transition as well. For data sizes above 190 bytes, an additional PCI gather cycle by the NIC is required to pull the data buffer for egress.
VMA uses the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. The on-device-memory is a resource managed by VMA and it is transparent to the user. The total size of the on-device-memory is limited to 256k for a single port HCA and to 128k for dual port HCA. Using VMA_RING_DEV_MEM_TX, the user can set the amount of on-device-memory buffer allocated for each TX ring.
Prerequisites
Driver: MLNX_OFED version 4.1-1.0.3.0.1 and above
NIC: NVIDIA® ConnectX®-5 and above.
Protocol: Ethernet.
Set VMA_RING_DEV_MEM_TX environment variable to best suit the application's requirements
Verifying On-Device Memory Capability in the Hardware
To verify “On Device Memory” capability in the hardware, run VMA with DEBUG trace level:
VMA_TRACELEVEL=DEBUG LD_PRELOAD=<path to libvma.so> <command line>
Look in the printout for a positive value of on-device-memory bytes.
For example:
Pid: 30089
Tid: 30089
VMA DEBUG: ibch[0xed61d0
]:245
:print_val() mlx5_0: port(s): 1
vendor: 4121
fw: 16.23
.0258
max_qp_wr: 32768
on_device_memory: 131072
To show and monitor On-Device Memory statistics, run vma_stats tool.
vma_stats –p <pid> -v 3
For example:
======================================================
RING_ETH=[0
]
Tx Offload: 858931
/ 3402875
[kilobytes/packets]
Rx Offload: 865251
/ 3402874
[kilobytes/packets]
Dev Mem Alloc: 16384
Dev Mem Stats: 739074
/ 1784935
/ 0
[kilobytes/packets/oob]
======================================================
In order to enable TCP_QUICKACK threshold, the user should modify TCP_QUICKACK_THRESHOLD parameter in the lwip/opt.h file and recompile VMA.
While TCP_QUICKACK option is enabled, TCP acknowledgments are sent immediately, rather than being delayed in accordance to a normal TCP receive operation. However, sending the TCP acknowledge delays the incoming packet processing to after the acknowledgement has been completed which can affect performance.
TCP_QUICKACK threshold enables the user to disable the quick acknowledgement for payloads that are larger than the threshold. The threshold is effective only when TCP_QUICKACK is enabled, using setsockopt() or using VMA_TCP_QUICKACK parameter. TCP_QUICKACK threshold is disabled by default.
Network virtual service client (NetVSC) exposes a virtualized view of the physical network adapter on the guest operating system. NetVSC can be configured to connect to a Virtual Function (VF) of a physical network adapter that supports an SR-IOV interface.
VMA is able to offload the traffic of the NetVSC using the SR-IOV interface, only if the SR-IOV interface is available during the application initialization.
While the SR-IOV interface is detached, VMA is able to redirect/forward ingress/egress packets to/from the NetVSC - this is done using a dedicated TAP device for each NetVSC, in addition to traffic control rules.
VMA can detect plugin and plugout events during runtime and route the traffic according to the events’ type.
Prerequisites
HCAs: NVIDIA® ConnectX®-5
Operating systems:
Ubuntu 16.04, kernel 4.15.0-1015-azure
Ubuntu 18.04, kernel 4.15.0-1015-azure
RHEL 7.5, kernel 3.10.0-862.9.1.el7
MLNX_OFED/Inbox driver: 4.5-x.x.x.x and above
WinOF: v5.60 and above, WinOF-2: v2.10 and above
Protocol: Ethernet
Root/Net cap admin permissions
VMA daemon enabled
Windows Hypervisor Configuration
For instructions on how to configure Windows Hypervisor, please refer to VMA section in WinOF User Manual at the Mellanox webpage under Products → Software → InfiniBand/VPI Drivers → MLNX_OFED.
VMA Daemon Design
VMA daemon is responsible for managing all traffic control logic of all VMA processes, including qdisc, u32 table hashing, adding filters, removing filters, removing filters when the application crashes.
For VMA daemon usage instructions, refer to the Installing the VMA Binary Package section in the Installation Guide.
For VMA daemon troubleshooting, see the Troubleshooting section.
TAP Statistics
To show and monitor TAP statistics, run the vma_stats tool:
vma_stats –p <pid> -v 3
Example:
======================================================
RING_TAP=[0
]
Master: 0x29e4260
Tx Offload: 4463
/ 67209
[kilobytes/packets]
Rx Offload: 5977
/ 90013
[kilobytes/packets]
Rx Buffers: 256
VF Plugouts: 1
Tap fd: 21
Tap Device: td34f15
======================================================
RING_ETH=[1
]
Master: 0x29e4260
Tx Offload: 7527
/ 113349
[kilobytes/packets]
Rx Offload: 7527
/ 113349
[kilobytes/packets]
Retransmissions: 1
======================================================
Output analysis:
RING_TAP[0] and RING_ETH[1] have the same bond master 0x29e4260
4463 Kbytes/67209 packets were sent from the TAP device
5977 Kbytes/90013 packets were received from the TAP device
Plugout event occurred once
TAP device fd number was 21, TAP name was td34f15