Packets transmitted over an offloaded socket may be rate-limited, thus, allowing granular rate control over the software defined flows. A rate-limited flow is allowed to transmit a few packets (burst) before its transmission rate is evaluated, and next packet is scheduled for transmission accordingly.
This is a simple form of Packet Pacing supporting basic functionalities. For advanced Packing Pacing support and wide-range specification, please refer to Rivermax library.
- MLNX_OFED version 4.1-x.x.x.x and above
VMA supports packet pacing with NVIDIA® ConnectX®-5 devices.
If you have MLNX_OFED installed, you can verify whether your NIC supports packet pacing by running:
Check the supported pace range under the section packet_pacing_caps (this range is in Kbit per second).
Ø To apply Packet Pacing to a socket:
- Run VMA with VMA_RING_ALLOCATION_LOGIC_TX=10.
Set the SO_MAX_PACING_RATE option for the socket:
- VMA converts the setsockopt value from bytes per second to Kbit per second.
- It is possible that the socket may be used over multiple NICs, some of which support Packet Pacing and some do not. Hence, setting the SO_MAX_PACING_RATE socket option does not guarantee that Packet Pacing will be applied.
In case of a failure when setting the packet pacing an error log will be printed to screen and no pacing will be done.
Precision Time Protocol (PTP)
VMA supports hardware timestamping for UDP-RX flow (only) with Precision Time Protocol (PTP).
When using VMA on a server running a PTP daemon, VMA can periodically query the kernel to obtain updated time conversion parameters which it uses in conjunction with the hardware time-stamp it receives from the NIC to provide synchronized time.
- Support devices: HCA clock available (NVIDIA® ConnectX®-4 and above)
- Set VMA_HW_TS_CONVERSION environment variable to 4
Set the SO_TIMESTAMPING option for the socket with value SOF_TIMESTAMPING_RX_HARDWARE:
- Set VMA environment parameter VMA_HW_TS_CONVERSION to 4.
Use the Linux kernel (v4.11) timestamping example found in the kernel source at: tools/testing/selftests/networking/timestamping/timestamping.c.
On-Device Memory is supported in ConnectX-5 adapter cards and above.
Each PCI transaction between the system’s RAM and NIC starts at ~300 nsec (and increasing depended on buffer size). Application egress latency can be improved by reducing as many PCI transition as possible on the send path.
Today, VMA achieves these goals by copying the WQE into the doorbell, and for small packets (<190 Bytes payload) VMA can inline the packet into the WQE and reduce the data gather PCI transition as well. For data sizes above 190 bytes, an additional PCI gather cycle by the NIC is required to pull the data buffer for egress.
VMA uses the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. The on-device-memory is a resource managed by VMA and it is transparent to the user. The total size of the on-device-memory is limited to 256k for a single port HCA and to 128k for dual port HCA. Using VMA_RING_DEV_MEM_TX, the user can set the amount of on-device-memory buffer allocated for each TX ring.
- Driver: MLNX_OFED version 4.1-220.127.116.11.1 and above
- NIC: NVIDIA® ConnectX®-5 and above.
- Protocol: Ethernet.
- Set VMA_RING_DEV_MEM_TX environment variable to best suit the application's requirements
Verifying On-Device Memory Capability in the Hardware
To verify “On Device Memory” capability in the hardware, run VMA with DEBUG trace level:
Look in the printout for a positive value of on-device-memory bytes.
To show and monitor On-Device Memory statistics, run vma_stats tool.
In order to enable TCP_QUICKACK threshold, the user should modify TCP_QUICKACK_THRESHOLD parameter in the lwip/opt.h file and recompile VMA.
While TCP_QUICKACK option is enabled, TCP acknowledgments are sent immediately, rather than being delayed in accordance to a normal TCP receive operation. However, sending the TCP acknowledge delays the incoming packet processing to after the acknowledgement has been completed which can affect performance.
TCP_QUICKACK threshold enables the user to disable the quick acknowledgement for payloads that are larger than the threshold. The threshold is effective only when TCP_QUICKACK is enabled, using setsockopt() or using VMA_TCP_QUICKACK parameter. TCP_QUICKACK threshold is disabled by default.
Linux Guest over Windows Hypervisor
Network virtual service client (NetVSC) exposes a virtualized view of the physical network adapter on the guest operating system. NetVSC can be configured to connect to a Virtual Function (VF) of a physical network adapter that supports an SR-IOV interface.
VMA is able to offload the traffic of the NetVSC using the SR-IOV interface, only if the SR-IOV interface is available during the application initialization.
While the SR-IOV interface is detached, VMA is able to redirect/forward ingress/egress packets to/from the NetVSC - this is done using a dedicated TAP device for each NetVSC, in addition to traffic control rules.
VMA can detect plugin and plugout events during runtime and route the traffic according to the events’ type.
- HCAs: NVIDIA® ConnectX®-5
- Operating systems:
- Ubuntu 16.04, kernel 4.15.0-1015-azure
- Ubuntu 18.04, kernel 4.15.0-1015-azure
- RHEL 7.5, kernel 3.10.0-862.9.1.el7
- MLNX_OFED/Inbox driver: 4.5-x.x.x.x and above
- WinOF: v5.60 and above, WinOF-2: v2.10 and above
- Protocol: Ethernet
- Root/Net cap admin permissions
- VMA daemon enabled
VMA Daemon Design
VMA daemon is responsible for managing all traffic control logic of all VMA processes, including qdisc, u32 table hashing, adding filters, removing filters, removing filters when the application crashes.
For VMA daemon usage instructions, refer to the Installing the VMA Binary Package section in the Installation Guide.
For VMA daemon troubleshooting, see the Troubleshooting section.
To show and monitor TAP statistics, run the vma_stats tool:
- RING_TAP and RING_ETH have the same bond master 0x29e4260
- 4463 Kbytes/67209 packets were sent from the TAP device
- 5977 Kbytes/90013 packets were received from the TAP device
- Plugout event occurred once
- TAP device fd number was 21, TAP name was td34f15