Advanced Features

NVIDIA Messaging Accelerator (VMA) Documentation Rev 9.8.31

Packets transmitted over an offloaded socket may be rate-limited, thus, allowing granular rate control over the software defined flows. A rate-limited flow is allowed to transmit a few packets (burst) before its transmission rate is evaluated, and next packet is scheduled for transmission accordingly.

Warning

This is a simple form of Packet Pacing supporting basic functionalities. For advanced Packing Pacing support and wide-range specification, please refer to Rivermax library.

Prerequisites

  • MLNX_OFED version 4.1-x.x.x.x and above

  • VMA supports packet pacing with NVIDIA® ConnectX®-5 devices.
    If you have MLNX_OFED installed, you can verify whether your NIC supports packet pacing by running:

    Copy
    Copied!
                

    ibv_devinfo –v

    Check the supported pace range under the section packet_pacing_caps (this range is in Kbit per second).

    Copy
    Copied!
                

    packet_pacing_caps: qp_rate_limit_min: 1kbps qp_rate_limit_max: 100000000kbps

Usage

Ø To apply Packet Pacing to a socket:

  1. Run VMA with VMA_RING_ALLOCATION_LOGIC_TX=10.

  2. Set the SO_MAX_PACING_RATE option for the socket:

    Copy
    Copied!
                

    uint32_t val = [rate in bytes per second]; setsockopt(fd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));

Notes:

  • VMA converts the setsockopt value from bytes per second to Kbit per second.

  • It is possible that the socket may be used over multiple NICs, some of which support Packet Pacing and some do not. Hence, setting the SO_MAX_PACING_RATE socket option does not guarantee that Packet Pacing will be applied.
    In case of a failure when setting the packet pacing an error log will be printed to screen and no pacing will be done.

VMA supports hardware timestamping for UDP-RX flow (only) with Precision Time Protocol (PTP).

When using VMA on a server running a PTP daemon, VMA can periodically query the kernel to obtain updated time conversion parameters which it uses in conjunction with the hardware time-stamp it receives from the NIC to provide synchronized time.

Prerequisites

  • Support devices: HCA clock available (NVIDIA® ConnectX®-4 and above)

  • Set VMA_HW_TS_CONVERSION environment variable to 4

Usage

  1. Set the SO_TIMESTAMPING option for the socket with value SOF_TIMESTAMPING_RX_HARDWARE:

    Copy
    Copied!
                

    uint8_t val = SOF_TIMESTAMPING_RX_HARDWARE setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));

  2. Set VMA environment parameter VMA_HW_TS_CONVERSION to 4.

Example:

Use the Linux kernel (v4.11) timestamping example found in the kernel source at: tools/testing/selftests/networking/timestamping/timestamping.c.

Copy
Copied!
            

Server $ sudo LD_PRELOAD=libvma.so VMA_HW_TS_CONVERSION=4 ./timestamping <iface> SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_RX_HARDWARE   Client $ LD_PRELOAD=libvma.so sockperf tp -i <server-ip> -t 3600 -p 6666 --mps 10   timestamping output: SOL_SOCKET SO_TIMESTAMPING SW 0.000000000 HW raw 1497823023.070846953 IP_PKTINFO interface index 8 SOL_SOCKET SO_TIMESTAMPING SW 0.000000000 HW raw 1497823023.170847260 IP_PKTINFO interface index 8 SOL_SOCKET SO_TIMESTAMPING SW 0.000000000 HW raw 1497823023.270847093 IP_PKTINFO interface index 8


Warning

On-Device Memory is supported in ConnectX-5 adapter cards and above.

Each PCI transaction between the system’s RAM and NIC starts at ~300 nsec (and increasing depended on buffer size). Application egress latency can be improved by reducing as many PCI transition as possible on the send path.

Today, VMA achieves these goals by copying the WQE into the doorbell, and for small packets (<190 Bytes payload) VMA can inline the packet into the WQE and reduce the data gather PCI transition as well. For data sizes above 190 bytes, an additional PCI gather cycle by the NIC is required to pull the data buffer for egress.

VMA uses the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. The on-device-memory is a resource managed by VMA and it is transparent to the user. The total size of the on-device-memory is limited to 256k for a single port HCA and to 128k for dual port HCA. Using VMA_RING_DEV_MEM_TX, the user can set the amount of on-device-memory buffer allocated for each TX ring.

Prerequisites

  • Driver: MLNX_OFED version 4.1-1.0.3.0.1 and above

  • NIC: NVIDIA® ConnectX®-5 and above.

  • Protocol: Ethernet.

  • Set VMA_RING_DEV_MEM_TX environment variable to best suit the application's requirements

Verifying On-Device Memory Capability in the Hardware

To verify “On Device Memory” capability in the hardware, run VMA with DEBUG trace level:

Copy
Copied!
            

VMA_TRACELEVEL=DEBUG LD_PRELOAD=<path to libvma.so> <command line>

Look in the printout for a positive value of on-device-memory bytes.

For example:

Copy
Copied!
            

Pid: 30089 Tid: 30089 VMA DEBUG: ibch[0xed61d0]:245:print_val() mlx5_0: port(s): 1 vendor: 4121 fw: 16.23.0258 max_qp_wr: 32768 on_device_memory: 131072

To show and monitor On-Device Memory statistics, run vma_stats tool.

Copy
Copied!
            

vma_stats –p <pid> -v 3

For example:

Copy
Copied!
            

====================================================== RING_ETH=[0] Tx Offload: 858931 / 3402875 [kilobytes/packets] Rx Offload: 865251 / 3402874 [kilobytes/packets] Dev Mem Alloc: 16384 Dev Mem Stats: 739074 / 1784935 / 0 [kilobytes/packets/oob] ======================================================


Warning

In order to enable TCP_QUICKACK threshold, the user should modify TCP_QUICKACK_THRESHOLD parameter in the lwip/opt.h file and recompile VMA.

While TCP_QUICKACK option is enabled, TCP acknowledgments are sent immediately, rather than being delayed in accordance to a normal TCP receive operation. However, sending the TCP acknowledge delays the incoming packet processing to after the acknowledgement has been completed which can affect performance.

TCP_QUICKACK threshold enables the user to disable the quick acknowledgement for payloads that are larger than the threshold. The threshold is effective only when TCP_QUICKACK is enabled, using setsockopt() or using VMA_TCP_QUICKACK parameter. TCP_QUICKACK threshold is disabled by default.

Network virtual service client (NetVSC) exposes a virtualized view of the physical network adapter on the guest operating system. NetVSC can be configured to connect to a Virtual Function (VF) of a physical network adapter that supports an SR-IOV interface.

VMA is able to offload the traffic of the NetVSC using the SR-IOV interface, only if the SR-IOV interface is available during the application initialization.

While the SR-IOV interface is detached, VMA is able to redirect/forward ingress/egress packets to/from the NetVSC - this is done using a dedicated TAP device for each NetVSC, in addition to traffic control rules.

VMA can detect plugin and plugout events during runtime and route the traffic according to the events’ type.

Prerequisites

  • HCAs: NVIDIA® ConnectX®-5

  • Operating systems:

    • Ubuntu 16.04, kernel 4.15.0-1015-azure

    • Ubuntu 18.04, kernel 4.15.0-1015-azure

    • RHEL 7.5, kernel 3.10.0-862.9.1.el7

  • MLNX_OFED/Inbox driver: 4.5-x.x.x.x and above

  • WinOF: v5.60 and above, WinOF-2: v2.10 and above

  • Protocol: Ethernet

  • Root/Net cap admin permissions

  • VMA daemon enabled

VMA Daemon Design

VMA daemon is responsible for managing all traffic control logic of all VMA processes, including qdisc, u32 table hashing, adding filters, removing filters, removing filters when the application crashes.

For VMA daemon usage instructions, refer to the Installing the VMA Binary Package section in the Installation Guide.

For VMA daemon troubleshooting, see the Troubleshooting section.

TAP Statistics

To show and monitor TAP statistics, run the vma_stats tool:

Copy
Copied!
            

vma_stats –p <pid> -v 3

Example:

Copy
Copied!
            

====================================================== RING_TAP=[0] Master: 0x29e4260 Tx Offload: 4463 / 67209 [kilobytes/packets] Rx Offload: 5977 / 90013 [kilobytes/packets] Rx Buffers: 256 VF Plugouts: 1 Tap fd: 21 Tap Device: td34f15 ====================================================== RING_ETH=[1] Master: 0x29e4260 Tx Offload: 7527 / 113349 [kilobytes/packets] Rx Offload: 7527 / 113349 [kilobytes/packets] Retransmissions: 1 ======================================================

Output analysis:

  • RING_TAP[0] and RING_ETH[1] have the same bond master 0x29e4260

  • 4463 Kbytes/67209 packets were sent from the TAP device

  • 5977 Kbytes/90013 packets were received from the TAP device

  • Plugout event occurred once

  • TAP device fd number was 21, TAP name was td34f15

© Copyright 2023, NVIDIA. Last updated on Sep 8, 2023.