VMA Extra API

The information in this chapter is intended for application developers who want to use VMA’s Extra API to maximize performance with VMA:

  • To further lower latencies

  • To increase throughput

  • To gain additional CPU cycles for the application logic

  • To better control VMA offload capabilities

All socket applications are limited to the given Socket API interface functions. The VMA Extra API enables VMA to open a new set of functions which allow the application developer to add code which utilizes zero copy receive function calls and low-level packet filtering by inspecting the incoming packet headers or packet payload at a very early stage in the processing.

VMA is designed as a dynamically-linked user-space library. As such, the VMA Extra API has been designed to allow the user to dynamically load VMA and to detect at runtime if the additional functionality described here is available or not. The application is still able to run over the general socket library without VMA loaded as it did previously, or can use an application flag to decide which API to use: Socket API or VMA Extra API.

The VMA Extra APIs are provided as a header with the VMA binary rpm. The application developer needs to include this header file in his application code.

After installing the VMA rpm on the target host, the VMA Extra APIs header file is located in the following link:

Copy
Copied!
            

#include "/usr/include/mellanox/vma_extra.h"

The vma_extra.h provides detailed information about the various functions and structures, and instructions on how to use them.

An example using the VMA Extra API can be seen in the udp_lat source code:

  • Follow the ‘--vmarxfiltercb’ flag for the packet filter logic

  • Follow the ‘--vmazcopyread’ flag for the zero copy recvfrom logic

A specific example for using the TCP zero copy extra API can be seen under extra_api_tests/tcp_zcopy_cb.

During runtime, use the vma_get_api() function to check if VMA is loaded in your application, and if the VMA Extra API is accessible.

If the function returns with NULL, either VMA is not loaded with the application, or the VMA Extra API is not compatible with the header function used for compiling your application. NULL will be the typical return value when running the application on native OS without VMA loaded.

Any non-NULL return value is a vma_api_t type structure pointer that holds pointers to the specific VMA Extra API function calls which are needed for the application to use.

It is recommended to call vma_get_api()once on startup, and to use the returned pointer throughout the life of the process.

There is no need to ‘release’ this pointer in any way.

Adding libvma.conf Rules During Run-Time

Adds a libvma.conf rule to the top of the list. This rule will not apply to existing sockets which already considered the conf rules. (around connect/listen/send/recv ..)

Syntax:

Copy
Copied!
            

int (*add_conf_rule)(char *config_line);

Return value:

  • 0 on success

  • error code on failure

Where:

  • config_line

    • Description – new rule to add to the top of the list (highest priority)

    • Value – a char buffer with the exact format as defined in libvma.conf, and should end with '\0'

Creating Sockets as Offloaded or Not-Offloaded

Creates sockets on pthread tid as off-loaded/not-off-loaded. This does not affect existing sockets. Offloaded sockets are still subject to libvma.conf rules.

Usually combined with the VMA_OFFLOADED_SOCKETS parameter.

Syntax:

Copy
Copied!
            

int (*thread_offload)(int offload, pthread_t tid);

Return value:

  • 0 on success

  • error code on failure

Where:

  • offload

    • Description – Offload property

    • Value – 1 for offloaded, 0 for not-offloaded

  • tid

    • Description – thread ID

The packet filter logic gives the application developer the capability to inspect a received packet. You can then decide, on the fly, to keep or drop the received packet at this stage in processing.

The user’s application packet filtering callback is defined by the prototype:

Copy
Copied!
            

typedef vma_recv_callback_retval_t (*vma_recv_callback_t) (int fd, size_t sz_iov, struct iovec iov[], struct vma_info_t* vma_info, void *context);

This callback function should be registered with VMA by calling the VMA Extra API function register_recv_callback(). It can be unregistered by setting a NULL function pointer.

VMA calls the callback to notify of new incoming packets after the internal IP & UDP/TCP header processing, and before they are queued in the socket's receive queue.

The context of the callback is always that of one of the user's application threads that called one of the following socket APIs: select(), poll(), epoll_wait(), recv(), recvfrom(), recvmsg(), read(), or readv().

Packet Filtering Callback Function Parameter

Description

fd

File descriptor of the socket to which this packet refers.

iov

iovector structure array pointer holding the packet received, data buffer pointers, and the size of each buffer.

iov_sz

Size of the iov array.

vma_info

Additional information on the packet and socket.

context

User-defined value provided during callback registration for each socket.

Warning

The application can call all the Socket APIs from within the callback context.

Packet loss might occur depending on the application's behavior in the callback context. A very quick non-blocked callback behavior is not expected to induce packet loss.

Parameters the "iov" and "vma_info" are only valid until the callback context is returned to VMA. You should copy these structures for later use, if working with zero copy logic.

Zero Copy recvfrom()

Zero-copy revcfrom implementation. This function attempts to receive a packet without doing data copy.

Syntax:

Copy
Copied!
            

int (*recvfrom_zcopy)(int s, void *buf, size_t len, int *flags, struct sockaddr *from, socklen_t *fromlen);

Where:

Parameter Name

Description

Values

s

Socket file descriptor

buf

Buffer to fill with received data or pointers to data (see below).

flags

Pointer to flags (see below).

Usual flags to recvmsg(), and MSG_VMA_
ZCOPY_FORCE

from

If not NULL, is set to the source address (same as recvfrom)

fromlen

If not NULL, is set to the source address size (same as recvfrom).

The flags parameter can contain the usual flags to recvmsg(), and also the MSG_VMA_ZCOPY_FORCE flag. If the latter is not set, the function reverts to data copy (i.e., zero-copy cannot be performed). If zero-copy is performed, the flag MSG_VMA_ZCOPY is set upon exit.

If zero copy is performed (MSG_VMA_ZCOPY flag is returned), the buffer is filled with a vma_packets_t structure holding as much fragments as `len' allows. The total size of all fragments is returned. Otherwise, the buffer is filled with actual data, and its size is returned (same as recvfrom()).

If the return value is positive, data copy has been performed. If the return value is zero, no data has been received.

Freeing Zero Copied Packet Buffers

Frees a packet received by "recvfrom_zcopy()" or held by "receive callback".

Syntax:

Copy
Copied!
            

int (*free_packets)(int s, struct vma_packet_t *pkts , size_t count);

Where:

  • s – socket from which the packet was received

  • pkts – array of packet identifiers

  • count – number of packets in the array

Return value:

  • 0 on success, -1 on failure

  • errno is set to:

    • EINVAL – not a VMA offloaded socket

    • ENOENT – the packet was not received from 's'.

Example:

Copy
Copied!
            

entry Source Source-mask Dest Dest-mask Interface Service Routing Status Log |------|------------|---------------|-----|----------|- 1 any any any any if0 any tunneling active 1 2 192.168.2.0 255.255..255.0 any any if1 any tunneling active 1

Expected result:

Copy
Copied!
            

sRB-20210G-61f0(statistic)# log show counter tx total pack tx total byte rx total pack rx total byte |------|-------------|-------------|-------------|-------------- 1 2733553 268066596 3698 362404

Parameter

Description

tx total byte

The number of transmit bytes (from InfiniBand-to-Ethernet) associated with a TFM rule; has a log counter n.

The above example shows the number of bytes sent from Infiniband to Ethernet (one way) or sent between InfiniBand and Ethernet and matching the two TFM rules with log counter #1.

rx total pack

The number of receive packets (from Ethernet to InfiniBand) associated with a TFM rule; has a log counter n.

rx total byte

The number of receive bytes (from Ethernet to InfiniBand) associated with a TFM rule; has a log counter n.

Dumps statistics for fd number using log_level log level.

Syntax:

Copy
Copied!
            

int (*dump_fd_stats) (int fd, int log_level);

Parameters:

Parameter

Description

fd

fd to dump, 0 for all open fds.

log_level

log_level dumping level corresponding vlog_levels_t enum (vlogger.h):
VLOG_NONE = -1
VLOG_PANIC = 0
VLOG_ERROR = 1
VLOG_WARNING = 2
VLOG_INFO =3
VLOG_DETAILS = 4
VLOG_DEBUG = 5
VLOG_FUNC = VLOG_FINE = 6
VLOG_FUNC_ALL = VLOG_FINER = 7
VLOG_ALL = 8


For output example see section Monitoring – the vma_stats Utility.Return values: 0 on success, -1 on failure

The “Dummy Send” feature gives the application developer the capability to send dummy packets in order to warm up the CPU caches on VMA send path, hence minimizing any cache misses and improving latency. The dummy packets reaches the hardware NIC and then is dropped.

The application developer is responsible for sending the dummy packets by setting the VMA_SND_FLAGS_DUMMY bit in the flags parameter of send(), sendto(), sendmsg(), and sendmmsg() sockets API.

Parameters:

Parameter

Description

VMA_SND_FLAGS_DUMMY

Indicates a dummy packet


Same as the original APIs for offloaded sockets. Otherwise, -1 is returned and errno is set to EINVAL.Return values:

Usage example:

Copy
Copied!
            

void dummyWait(Timer waitDuration, Timer dummySendCycleDuration) { Timer now = Timer::now(); Timer endTime = now + waitDuration; Timer nextDummySendTime = now + dummySendCycleDuration; for ( ; now < endTime ; now = Timer::now()) { if (now >= nextDummySendTime) { send(fd, buf, len, VMA_SND_FLAGS_DUMMY); nextDummySendTime += dummySendCycleDuration; } } }

This sample code consistently sends dummy packets every DummysendCycleDuration using the VMA extra API while the total time does not exceed waitDuration.

Warning

It is recommended not to send more than 50k dummy packets per second.

Verifying “Dummy Send” capability in HW

“Dummy Send” feature is supported in hardware starting from ConnectX-4 NIC.

In order to verify “Dummy Send” capability in the hardware, run VMA with DEBUG trace level.

Copy
Copied!
            

VMA_TRACELEVEL=DEBUG LD_PRELOAD=<path to libvma.so> <command line>

Look in the printout for “HW Dummy send support for QP = [0|1]”.

For example:

Copy
Copied!
            

Pid: 3832 Tid: 3832 VMA DEBUG: qpm[0x2097310]:121:configure() Creating QP of transport type 'ETH' on ibv device 'mlx5_0' [0x201e460] on port 1 Pid: 3832 Tid: 3832 VMA DEBUG: qpm[0x2097310]:137:configure() HW Dummy send support for QP = 1 Pid: 3832 Tid: 3832 VMA DEBUG: cqm[0x203a460]:269:cq_mgr() Created CQ as Tx with fd[25] and of size 3000 elements (ibv_cq_hndl=0x20a0000)

“Dummy Packets” Statistics

Run vma_stats tool to view the total amount of dummy-packets sent.

Copy
Copied!
            

vma_stats –p <pid> -v 3

The number of dummy messages sent will appear under the relevant fd. For example:

Copy
Copied!
            

====================================================== Fd=[20] - UDP, Blocked, MC Loop Enabled - Local Address = [0.0.0.0:56732] Tx Offload: 128 / 9413 / 0 / 0 [kilobytes/packets/drops/errors] Tx Dummy messages : 87798 Rx Offload: 128 / 9413 / 0 / 0 [kilobytes/packets/eagains/errors] Rx byte: cur 0 / max 14 / dropped 0 / limit 212992 Rx pkt : cur 0 / max 1 / dropped 0 Rx poll: 0 / 9411 (100.00%) [miss/hit] ======================================================

Warning

Starting from VMA v8.5.x, VMA_POLL parameter is renamed to SocketXtreme.

The API introduced for this capability allows an application to remove the overhead of socket API from the receive flow data path, while keeping the well-known socket API for the control interface. Using such functionality the application has almost direct access to VMA’s HW ring object and it is possible to implement a design which does not call socket APIs such as select(), poll(), epoll_wait(), recv(), recvfrom(), recvmsg(), read(), or readv().

The structures and constants are defined as shown below.

VMA Specific Events

Copy
Copied!
            

typedef enum { VMA_SOCKETXTREME_PACKET = (1ULL << 32), VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED = (1ULL << 33) } vma_socketxtreme_events_t;

Parameter

Description

VMA_SOCKETXTREME_PACKET

New packet is available

VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED

New connection is auto accepted by server

VMA Buffer

Copy
Copied!
            

struct vma_buff_t { struct vma_buff_t* next; void* payload; uint16_t len; };

Parameter

Description

next

Next buffer (for last buffer next == NULL)

payload

Point to data

len

Data length

Copy
Copied!
            

struct vma_packet_desc_t { size_t num_bufs; uint16_t total_len; struct vma_buff_t* buff_lst; };


VMA Packet

Parameter

Description

total_len

Total data length

buff_lst

List of packet's buffers

len

Data length

Copy
Copied!
            

struct vma_completion_t { struct vma_packet_desc_t packet; uint64_t events; uint64_t user_data; struct sockaddr_in src; int listen_fd; };

Parameter

Description

events

Set of events

user_data

User provided data

  • By default this field has FD of the socket

  • User is able to change the content using setsockopt() with level argument SOL_SOCKET and opname as SO_VMA_USER_DATA

src

Source address (in network byte order) set for VMA_SOCKETXTREME_PACKET and VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED events

listen_fd

Connected socket's parent/listen socket fd number.
Valid in case VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED event is set.

Polling for VMA Completions

Syntax:

Copy
Copied!
            

int (*socketxtreme_poll)(int fd, struct vma_completion_t* completions, unsigned int ncompletions, int flags);

Where

  • fd – file descriptor

  • completions – array of completion elements

  • ncompletions – number of elements in passed array

  • flags – flags to control behavior (set zero)

Return values: Returns the number of ready completions during success. A negative value is returned in case of failure.

Description: This function polls the `fd` for VMA completions and returns maximum `ncompletions` - ready completions via the `completions` array. The `fd` represents a ring file descriptor. VMA completions are indicated for incoming packets and/or for other events. If VMA_SOCKETXTREME_PACKET flag is enabled in the vma_completion_t.events field the completion points to the incoming packet descriptor that can be accessed via the vma_completion_t.packet field. Packet descriptor points to the VMA buffers that contain data scattered by HW, so the data is delivered to the application with zero copy. Notice: after the application is finished with the returned packets and their buffers it must free them using free_vma_packets()/free_vma_buff() functions. If VMA_SOCKETXTREME_PACKET flag is disabled vma_completion_t.packet field is reserved. In addition to packet arrival event (indicated by VMA_SOCKETXTREME_PACKET flag) VMA also reports VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED event and standard epoll events via the vma_completion_t.events field. VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED event is reported when new connection is accepted by the server. When working with socketxtreme_poll() new connections are accepted automatically and accept (listen_socket) must not be called. VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED event is reported for the new connected/child socket (vma_completion_t.user_data refers to child socket) and EPOLLIN event is not generated for the listen socket. For events other than packet arrival and new connection acceptance vma_completion_t.events bitmask composed using standard epoll API events types. Notice: the same completion can report multiple events, for example VMA_SOCKETXTREME_PACKET flag can be enabled together with EPOLLOUT event, etc.

Getting Number of Attached Rings

Syntax:

Copy
Copied!
            

int (*get_socket_rings_num)(int fd);

Where:

  • fd – file descriptor

Return values: Returns the number of rings during success. A negative value is returned in case of failure.

Description: Returns the number of rings that are associated with socket.

Getting Ring FD

Syntax:

Copy
Copied!
            

int (*get_socket_rings_fds)(int fd, int *ring_fds, int ring_fds_sz);

Where:

  • fd – file descriptor

  • ring_fds – int array of ring fds

  • ring_fds_sz – size of the array

Return values: Returns the number populated array entries during success. A negative value is returned in case of failure.

Description: Returns FDs of the rings that are associated with the socket.

Free VMA Packets

Syntax:

Copy
Copied!
            

int (*socketxtreme_free_vma_packets)(struct vma_packet_desc_t *packets, int num);

Where:

  • packets – packets to be freed

  • num – number of packets in passed array

Return values: Returns zero value during success. A negative value is returned in case failure.

Description: Frees packets received by socketxtreme_poll().

For each packet in the `packets` array this function updates the receive queue size and the advertised TCP window size, if needed, for the socket that received the packet and frees VMA buffer list that is associated with the packet. Notice: for each buffer in the buffer list VMA decreases buffer's ref count and only buffers with ref count zero are deallocated. An application can call socketxtreme_ref_vma_buf() to increase the buffer reference count in order to hold the buffer even after socketxtreme_free_vma_packets() has been called. Also, the application is responsible to free buffers that could not be deallocated during socketxtreme_free_vma_packets() due to non-zero reference count. This is done by calling the socketxtreme_free_vma_buff() function.

Decrement VMA Buffer Reference Counter

Syntax:

Copy
Copied!
            

int (*socketxtreme_free_vma_buff)(struct vma_buff_t *buff);

Return values: Returns the buffer's reference count after the change (zero value means that the buffer has been deallocated). A negative value is returned in case of failure.

Description: Decrement the reference counter of a buffer received by socketxtreme_poll(). This function decrements the buff reference count. When buff's reference count reaches zero, it is deallocated.

Increment VMA Buffer Reference Counter

Syntax:

Copy
Copied!
            

int (*socketxtreme_ref_vma_buff)(struct vma_buff_t *buff);

Where:

  • buff – buffer to be managed

Return values: Returns buffer's reference count after the change. A negative value is returned in case of failure.

Description: Increment the reference counter of a buffer received by socketxtreme_poll(). This function increments the reference count of the buffer. This function should be used in order to hold the buffer even after a call to socketxtreme_free_vma_packets(). When the buffer is no longer required it should be freed via socketxtreme_free_vma_buff ().

Usage Example

Sockperf benchmark supports socketxtreme mode. Its source code can be used as a reference of socketxtreme API usage.

The following sample implements server side logic based on the API described above.

In this example, the application just waits for connection requests and accepts new connections.

Copy
Copied!
            

#include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <errno.h> #include <sys/socket.h> #include <netinet/in.h> #include <arpa/inet.h> #include <mellanox/vma_extra.h>     int main(int argc, char**argv) { int rc = 0; int fd = -1; struct sockaddr_in addr; static struct vma_api_t *_vma_api = NULL; static int _vma_ring_fd = -1; char *strdev = (argc > 1 ? argv[1] : NULL); char *straddr = (argc > 2 ? argv[2] : NULL); char *strport = (argc > 3 ? argv[3] : NULL);   if (!strdev || !straddr || !strport) { printf("Wrong options\n"); exit(1); } printf("Dev: %s\nAddress: %s\nPort:%s\n", strdev, straddr, strport);   _/* Get VMA extra API reference */ _vma_api = vma_get_api(); if (_vma_api == NULL) { printf("VMA Extra API not found\n"); exit(1); }   fd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);   rc = setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, (void *)strdev, strlen(strdev)); if (rc < 0) { printf("setsockopt() failed %d : %s\n", errno, strerror(errno)); exit(1); }   bzero(&addr, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_addr.s_addr = inet_addr(straddr); addr.sin_port = htons(atoi(strport));   rc = bind(fd, (struct sockaddr *)&addr, sizeof(addr)); if (rc < 0) { fprintf(stderr, "bind() failed %d : %s\n", errno, strerror(errno)); exit(1); }   _/* RX ring is available after bind() */ _vma_api->get_socket_rings_fds(fd, &_vma_ring_fd, 1); if (_vma_ring_fd == -1){ printf("Failed to return the ring fd\n"); exit(1); }   listen(fd, 5); printf("Waiting on: fd=%d\n", fd);   while (0 == rc) { struct vma_completion_t vma_comps; /* Polling any RX events / data */ rc = _vma_api->socketxtreme_poll(_vma_ring_fd, &vma_comps, 1, 0); if (rc > 0) { printf("socketxtreme_poll: rc=%d event=0x%lx user_data=%ld\n", rc, vma_comps.events, vma_comps.user_data); if (vma_comps.events & VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED) { printf("Accepted connection: fd=%d\n", (int)vma_comps.user_data); rc = 0; } } }   close(fd); fprintf(stderr, "socket closed\n");   return 0;

Installation

For instructions on how to install SocketXtreme, please refer to section Installing VMA with SocketXtreme.

Limitations

No support for:

  • Multi-thread

  • ConnectX-3/ConnectX-3 Pro HCAs

  • MLNX_OFED version lower than v3.4

User should keep in mind the differences in flow between the standard socket API and that based on the polling completions model.

  • SocketXtreme mode is used with non-blocking connect() call only

  • Do not use accept() because socketxtreme_poll() provides special event as VMA_socketxtreme_NEW_CONNECTION_ACCEPTED to track connection request

  • Mixed receive methods (recv/recvfrom/recmsg/epoll_wait with socketXtreme) can cause the user to receive out-of-order packets. UDP is an unreliable protocol, hence working with mixed receive methods are allowed yet not recommended. Whereas TCP is a reliable protocol, hence mixed receive methods are not allowed. socketxtreme_poll() is able to notify about any received data using the event VMA_socketxtreme_PACKET.

© Copyright 2023, NVIDIA. Last updated on Sep 8, 2023.