XLIO Extra API

1.0

The information in this chapter is intended for application developers that want to maximize XLIO performance may use the Extra API and achieve the following:

  • To further lower latencies

  • To increase throughput

  • To gain additional CPU cycles for the application logic

  • To better control XLIO offload capabilities

All socket applications are limited to the given Socket API interface functions.

The XLIO Extra API enables XLIO to open a new set of functions which allow the application developer to add code which utilizes zero copy receive function calls and low-level packet filtering by inspecting the incoming packet headers or packet payload at a very early stage in the processing.

XLIO is designed as a dynamically linked user-space library. As such, the XLIO Extra API has been designed to allow the user to dynamically load XLIO and to detect at runtime if the additional functionality described here is available or not. The application is still able to run over the general socket library without XLIO loaded as it did previously, or can use an application flag to decide which API to use: Socket API or XLIO Extra API.

The XLIO Extra APIs are provided as a header with the XLIO binary rpm. The application developer needs to include this header file in his application code.

After installing the XLIO RPM on the target host, the XLIO Extra APIs header file is located in the following link:

Copy
Copied!
            

#include "/usr/include/mellanox/xlio_extra.h"

The xlio_extra.h provides detailed information about the various functions and structures, and instructions on how to use them.

An example using the XLIO Extra API can be seen in the udp_lat source code. Follow the ‘--xliozcopyread’ flag for the zero copy recvfrom logic.

A specific example for using the TCP zero copy extra API can be seen under extra_api_tests/tcp_zcopy_cb.

During runtime, use the xlio_get_api() function to check if XLIO is loaded in your application and if XLIO Extra API is accessible.

If the function returns with NULL, either XLIO is not loaded with the application, or the XLIO Extra API is not compatible with the header function used for compiling your application. NULL will be the typical return value when running the application on native OS without XLIO loaded. On success the function returns a valid api() pointer and NULL on failure.

Any non-NULL return value is a xlio_api_t type structure pointer that holds pointers to the specific XLIO Extra API function calls which are needed for the application to use. Available functions can be checked using special bit mask field as cap_mask.

It is recommended to call xlio_get_api()once on startup, and to use the returned pointer throughout the life of the process.

There is no need to ‘release’ this pointer in any way.

Adding libxlio.conf Rules During Run-Time

Adds a libxlio.conf rule to the top of the list. This rule will not apply to existing sockets which already considered the conf rules. (around connect/listen/send/recv ..)

Syntax:

Copy
Copied!
            

int (*add_conf_rule)(char *config_line);

Return value:

  • 0 on success

  • error code on failure

Where:

  • config_line

    • Description – new rule to add to the top of the list (highest priority)

    • Value – a char buffer with the exact format as defined in libxlio.conf, and should end with '\0'

Creating Sockets as Offloaded or Not-Offloaded

Creates sockets on pthread tid as off-loaded/not-off-loaded. This does not affect existing sockets. Offloaded sockets are still subject to libxlio.conf rules.

Usually combined with the XLIO_OFFLOADED_SOCKETS parameter.

Syntax:

Copy
Copied!
            

int (*thread_offload)(int offload, pthread_t tid);

Return value:

  • 0 on success

  • error code on failure

Where:

  • offload

    • Description – Offload property

    • Value – 1 for offloaded, 0 for not-offloaded

  • tid

    • Description – thread ID

Zero Copy recvfrom()

Zero-copy recvfrom implementation. This function attempts to receive a packet without doing data copy.

Syntax:

Copy
Copied!
            

int (*recvfrom_zcopy)(int s, void *buf, size_t len, int *flags, struct sockaddr *from, socklen_t *fromlen);

Where:

Parameter Name

Description

Values

s

Socket file descriptor

buf

Buffer to fill with received data or pointers to data (see below).

flags

Pointer to flags (see below).

Usual flags to recvmsg(), and MSG_XLIO_

ZCOPY_FORCE

from

If not NULL, is set to the source address (same as recvfrom)

fromlen

If not NULL, is set to the source address size (same as recvfrom).

The flags parameter can contain the usual flags to recvmsg(), and also the MSG_XLIO_ZCOPY_FORCE flag. If the latter is not set, the function reverts to data copy (i.e., zero-copy cannot be performed). If zero-copy is performed, the flag MSG_XLIO_ZCOPY is set upon exit.

If zero copy is performed (MSG_XLIO_ZCOPY flag is returned), the buffer is filled with a xlio_recvfrom_zcopy_packets_t structure holding as much fragments as `len' allows. The total size of all fragments is returned. Otherwise, the buffer is filled with actual data, and its size is returned (same as recvfrom()).

If the return value is positive, data copy has been performed. If the return value is zero, no data has been received.

Freeing Zero Copied Packet Buffers

Frees a packet received by "recvfrom_zcopy()" or held by "receive callback".

Syntax:

Copy
Copied!
            

int (*recvfrom_zcopy_free_packets)(int s, struct xlio_recvfrom_zcopy_packet_t *pkts , size_t count);

Where:

  • s – socket from which the packet was received

  • pkts – array of packet identifiers

  • count – number of packets in the array

Return value:

  • 0 on success, -1 on failure

  • errno is set to:

    • EINVAL – not a offloaded socket

    • ENOENT – the packet was not received from 's'.

Example:

Copy
Copied!
            

entry Source Source-mask Dest Dest-mask Interface Service Routing Status Log |------|------------|---------------|-----|----------|- 1 any any any any if0 any tunneling active 1 2 192.168.2.0 255.255..255.0 any any if1 any tunneling active 1

Expected result:

Copy
Copied!
            

sRB-20210G-61f0(statistic)# log show counter tx total pack tx total byte rx total pack rx total byte  |------|-------------|-------------|-------------|-------------- 1 2733553 268066596 3698 362404

Parameter

Description

tx total byte

The number of transmit bytes associated with a TFM rule; has a log counter n.The above example shows the number of bytes sent from Infiniband to Ethernet (one way) or sent between InfiniBand and Ethernet and matching the two TFM rules with log counter #1.

rx total pack

The number of receive packets associated with a TFM rule; has a log counter n.

rx total byte

The number of receive bytes associated with a TFM rule; has a log counter n.


The XLIO SocketXtreme API has been developed to optimize the data path of the socket API, while preserving the familiar standard socket API for control operations, such as select(), poll(), epoll_wait(), recv(), recvfrom(), recvmsg(), read(), and readv().

The XLIO SocketXtreme API enhances application performance in the following ways:

  1. Reduced Context Switching: The lightweight library call to socketxtreme_poll(...) eliminates the need for traditional methods like `poll()`, `select()`, `epoll()`, and similar interfaces, as well as subsequent read(), recv(), recvmsg(), and other socket-related system calls. It achieves asynchronous I/O for both data reception (RX) and data transmission (TX) by concurrently polling multiple sockets on the same ring (more information about rings is provided in subsequent sections).

  2. Comprehensive Data Handling: The socketxtreme_poll function provides the user application with detailed data in the form of the struct xlio_socketxtreme_completion_t. This structure, elaborated upon below, equips the application to effectively manage the status of sockets associated with the ring file descriptor and process the data received through these sockets.

Usage

Verify SocketXtreme support

To employ the XLIO extra API and the SocketXtreme interface, follow these steps:

1. Obtain the XLIO API using the `xlio_get_api` function.
2. Verify the `XLIO_MAGICNUMBER` to ensure compatibility.
3. Verify the XLIO capabilities, as demonstrated below.

Copy
Copied!
            

#include <mellanox/xlio extra .h>   static struct xlio api t •get xlio_ api (void) { struct xlio_ api_ t •api_ ptr = NULL; int err = xlio_ get_ api (): if (err < 0) { return NULL; } if (api_ptr NULL) { return NULL; } if (api_ptr->maqic != XLIO_MAGICNOMBER) { printf ("Unexpected XLIO AP! magic number : expected %"  PRix64 ", got %" PRix64 "\n", (uint64_t)XLIO_MAGIC_NOMBER , api_ptr->maqic); goto failed ; }     uint64_t required_caps = XLIO_EXTRA_API_GET_SOCKET_RINGS_FDS | XLIO_EXTRA_API_SOCKETXTREME_POLL | XLIO_EXTRA_API_SOCKETXTREME_FREE_PACKETS | XLIO EXTRA AP IOCTL; if ((api_ ptr->cap_mask & required_ caps) != required_ caps) { printf ("Required XLIO caps are missing: required %" PRix64 ", got %" PRix64 "\n", required_ caps, api_ ptr->cap_mask ); goto failed ; } return api_ ptr : failed : free (api_ptr): return NULL; }


Replace the socket file descriptor with alternative identifier

XLIO introduces the capability to substitute a socket file descriptor with user-defined data. To utilize this feature, XLIO provides a new setsockopt option, as illustrated in the example below.

Copy
Copied!
            

uint64 t user_data = (uintptr_t)user_ptr; int re = xlio_api->setsockopt (socket_ fd, SOL_SOCKET, SO_XLIO_USER_DATA , &user_data , sizeof (user_data)); if ( re != 0) { printf ("Failed to set socket user data for sock %d: re %d, errno %d\n", socket_ fd, re, errno); goto fail; }


Get ring file descriptors(s) from socket file descriptor

The ring file descriptor(s) will be required as arguments to socktxtreme_poll.

Copy
Copied!
            

int ring_fds[2]; int num_rings;   hum_rings = xlio_api->get socket rings_ fds (socket_ fd, ring_ fds, 2); if (num_rinqs < 0) {printf ("Failed to get ring FDs for socket errno %d: rc %d, errno %d\n", socket_ fd, num_rings, for socket errno); }

Parameter/Return value

Description

num_rings

The actual number of rings returned by the function. If the number is -1 the function failed check the errno

ring_fds

An array of integers to be filled upon success

num_ring_fds

The maximal number of ring descriptors to fill


Poll ring fd

The ring file descriptor(s) will be required as arguments to socktxtreme_poll.

Copy
Copied!
            

struct xlio_socketxtreme_completion_t comps[11AX_EVENTS_ PER_ POLLJ ; int num_events = xlio_api->socketxtreme_poll(ring_ fd, comps, MAX_EVENTS_ PER_ POLL, SOCKETXTREME_ POLL_TX) ;

Parameter

Description

num_events

The actual number of events returned by the function. If the number is -1 the function failed check the errno

ring_fd

The ring file descriptor to poll

comps

An array of XLIO completion events


Processing the completions

For detailed xlio_socket_xtreme_completion_t please review the xlio_extra.h header file.
When the `XLIO_SOCKETXTREME_PACKET` flag is enabled within the `xlio_completion_t.events` field, it indicates that the completion is associated with the descriptor of an incoming packet. You can access this descriptor through the `xlio_completion_t.packet` field. This descriptor points to XLIO buffers that hold data distributed by the hardware, ensuring that the data is delivered to the application without the need for copying. It's essential to remember that after the application has finished using the returned packets and their associated buffers, they must be released using the `free_xlio_packets()` and `free_xlio_buff()` functions. If the `XLIO_SOCKETXTREME_PACKET` flag is disabled, the `xlio_completion_t.packet` field remains reserved.

In addition to indicating the arrival of a packet, XLIO also reports the `XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED` event and standard epoll events through the `xlio_completion_t.events` field. The `XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED` event is triggered when a new connection is accepted by the server. When using `socketxtreme_poll()`, new connections are accepted automatically, and there's no need to explicitly call `accept()` on the listen socket. The `XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED` event pertains to the newly connected or child socket, with `xlio_completion_t.user_data` referring to the child socket. For events other than packet arrival and new connection acceptance, the `xlio_completion_t.events` bitmask is constructed using standard epoll API event types. It's important to note that the same completion can report multiple events; for instance, the `XLIO_SOCKETXTREME_PACKET` flag can be enabled alongside `EPOLLOUT` events and more.

Copy
Copied!
            

for (1 = 0; i <num_events; i++ { struct xlio_socketxtreme_completion_t *comp = &comps[i); assert (comp->user_data != 0):   /* Convert the user data to an application identifier and consume he data. Below we assume that the application converts the identifier to app_sink pointer. */ app sink t •app_sink = (app_sink_t *)comp->user_data ;   if (comp->events & EPOLLERR) { app_sink->process_error_cb (): }   if (comp->events & XLIO_SOCKETXTREME_PACKET) { app_sink->process_packet (comp->packet): } if (comp->events & XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED) { app_sink->process_accepted_connection (comp->packet): } }


Freeing packets and buffers

In the following example, we iterate through the buffers of a specific packet, free them, and then release the packet.

Copy
Copied!
            

static inline void free_xlio_packet (struct xlio_socketxtreme_packet_desc_t *packet) { assert (packet != NULL); while (packet->num_bufs--) { assert (packet->buff_lst != NULL); struct xlio buff t *buff to free = packet->buff_lst; packet->buff_lst = packet->buff_lst->next; xlio_api->socketxtreme_ free_buff (buff_to_ free); } xlio_api->socketxtreme_ free_packets(packet, 1); }

The packet filter logic allows developers to inspect and dynamically decide whether to keep or drop incoming packets. The user's packet filtering callback follows this prototype:

Copy
Copied!
            

typedef xlio_recv_callback_retval_t (*xlio_recv_callback_t) (int fd, size_t sz_iov, struct iovec iov[], struct xlio_info_t *xlio_info, void *context);

To register this callback function with XLIO, use the register_recv_callback() function provided by the XLIO Extra API. If you wish to unregister the callback, simply set the function pointer to NULL.

XLIO invokes this callback to inform the application about new incoming packets. This notification occurs after the internal processing of IP and UDP/TCP headers and precedes the queuing of these packets in the socket's receive queue.

The context of the callback is always that of one of the user's application threads that have previously called one of the following socket APIs: select(), poll(), epoll_wait(), recv(), recvfrom(), recvmsg(), read(), or readv().

fd

File descriptor of the socket to which this packet refers.

iov

iovector structure array pointer holding the packet received, data buffer pointers, and the size of each buffer.

iov sz

Size of the iov array.

xlio info

Additional information on the packet and socket.

context

User-defined value provided during callback registration for each socket.

Warning

The application is allowed to invoke all Socket APIs from within the callback context. However, it's important to note that depending on how the application behaves in this context, packet loss may occur. In cases where the callback behavior is extremely rapid and non-blocking, it's less likely to lead to packet loss.

Regarding the parameters "iov" and "xlio_info," it's crucial to understand that their validity is limited to the duration of the callback context. If you intend to work with zero-copy logic and require these structures for later use, it's advisable to make copies of them before exiting the callback context.

Dumps statistics for fd number using log_level log level.

Syntax:

Copy
Copied!
            

int (*dump_fd_stats) (int fd, int log_level);

Parameters:

Parameter

Description

fd

fd to dump, 0 for all open fds.

log_level

log_level dumping level corresponding vlog_levels_t enum (vlogger.h):

VLOG_NONE = -1

VLOG_PANIC = 0

VLOG_ERROR = 1

VLOG_WARNING = 2

VLOG_INFO =3

VLOG_DETAILS = 4

VLOG_DEBUG = 5

VLOG_FUNC = VLOG_FINE = 6

VLOG_FUNC_ALL = VLOG_FINER = 7

VLOG_ALL = 8

For output example see section Monitoring – the xlio_stats Utility . Return values: 0 on success, -1 on failure

The “Dummy Send” feature gives the application developer the capability to send dummy packets in order to warm up the CPU caches on XLIO send path, hence minimizing any cache misses and improving latency. The dummy packets reaches the hardware NIC and then is dropped.

The application developer is responsible for sending the dummy packets by setting the XLIO_SND_FLAGS_DUMMY bit in the flags parameter of send(), sendto(), sendmsg(), and sendmmsg() sockets API.

Parameters:

Parameter

Description

XLIO_SND_FLAGS_DUMMY

Indicates a dummy packet

Same as the original APIs for offloaded sockets. Otherwise, -1 is returned and errno is set to EINVAL.Return values:

Usage example:

Copy
Copied!
            

void dummyWait(Timer waitDuration, Timer dummySendCycleDuration) { Timer now = Timer::now(); Timer endTime = now + waitDuration; Timer nextDummySendTime = now + dummySendCycleDuration; for ( ; now < endTime ; now = Timer::now()) { if (now >= nextDummySendTime) { send(fd, buf, len, XLIO_SND_FLAGS_DUMMY); nextDummySendTime += dummySendCycleDuration; } } }

This sample code consistently sends dummy packets every DummysendCycleDuration using the XLIO extra API while the total time does not exceed waitDuration.

Warning

It is recommended not to send more than 50k dummy packets per second.

Verifying “Dummy Send” Capability in HW

In order to verify “Dummy Send” capability in the hardware, run XLIO with DEBUG trace level.

Copy
Copied!
            

XLIO_TRACELEVEL=DEBUG LD_PRELOAD=<path to libxlio.so> <command line>

Look in the printout for “HW Dummy send support for QP = [0|1]”.

For example:

Copy
Copied!
            

Pid: 3832 Tid: 3832 XLIO DEBUG: qpm[0x2097310]:121:configure() Creating QP of transport type 'ETH' on ibv device 'mlx5_0' [0x201e460] on port 1 Pid: 3832 Tid: 3832 XLIO DEBUG: qpm[0x2097310]:137:configure() HW Dummy send support for QP = 1 Pid: 3832 Tid: 3832 XLIO DEBUG: cqm[0x203a460]:269:cq_mgr() Created CQ as Tx with fd[25] and of size 3000 elements (ibv_cq_hndl=0x20a0000)


“Dummy Packets” Statistics

Run xlio_stats tool to view the total amount of dummy-packets sent.

Copy
Copied!
            

xlio_stats –p <pid> -v 3

The number of dummy messages sent will appear under the relevant fd. For example:

Copy
Copied!
            

====================================================== Fd=[20] - UDP, Blocked, MC Loop Enabled - Local Address = [0.0.0.0:56732] Tx Offload: 128 / 9413 / 0 / 0 [kilobytes/packets/drops/errors] Tx Dummy messages : 87798 Rx Offload: 128 / 9413 / 0 / 0 [kilobytes/packets/eagains/errors] Rx byte: cur 0 / max 14 / dropped 0 / limit 212992 Rx pkt : cur 0 / max 1 / dropped 0 Rx poll: 0 / 9411 (100.00%) [miss/hit] ======================================================


This function allows to communicate with library using extendable protocol

based on struct cmshdr.

Syntax:

Copy
Copied!
            

int (*ioctl) (void *cmsg_hdr, size_t cmsg_len);

Parameters:

Parameter

Description

cmsg_hdr

The address of the ancillary data.

cmsg_len

The length of the ancillary data is passed in cmsg_hdr. Note that if multiple ancillary data sections are being passed, this length should reflect the total length of ancillary data sections

The cmsg_hdr parameter points to the ancillary data. This cmsg_hdr pointer points to the following structure (C/C++ example shown) that describes the ancillary data.

Copy
Copied!
            

struct cmsghdr {     size_t   cmsg_len;       /* data byte count includes hdr */     int      cmsg_level;     /* originating protocol         */     int      cmsg_type;      /* protocol-specific type       */     /* followed by u_char    cmsg_data[]; */    };

Ancillary data is a sequence of cmsghdr structures with appended data. The sequence of cmsghdr structures should never be accessed directly. Instead, use only the following macros: CMSG_ALIGN, CMSG_SPACE, CMSG_DATA, CMSG_LEN.

Guidelines:

  • The cmsg_len should be set to the length of the cmsghdr plus the length of all ancillary data that follows immediately after the cmsghdr. This is represented by the commented out cmsg_data field.

  • The cmsg_level should be set to the option level (for example, SOL_SOCKET).

  • The cmsg_type should be set to the option name (for example, CMSG_XLIO_IOCTL_USER_ALLOC).

Supported commands:

Command

Description

CMSG_XLIO_IOCTL_USER_ALLOC

Use user defined function to allocate global pools

CMSG_XLIO_IOCTL_USER_ALLOC

Filed size

Description

uint8_t

control flags

uintptr_t

pointer to memory allocation function

uintptr_t

pointer to memory free function

Control Flags

Copy
Copied!
            

enum { IOCTL_USER_ALLOC_TX = (1 << 0), IOCTL_USER_ALLOC_RX = (1 << 1), IOCTL_USER_ALLOC_TX_ZC = (1 << 2) };

Usage Example

In this example, the application uses CMSG_XLIO_IOCTL_USER_ALLOC command.

Copy
Copied!
            

#include <sys/socket.h> #include <sys/types.h> #include <netinet/in.h> #include <netdb.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <arpa/inet.h> #include <mellanox/xlio_extra.h> void * my_alloc(size_t sz_bytes) {         void *m_data_block = NULL;         long page_size = sysconf(_SC_PAGESIZE);         if (page_size > 0) {                     sz_bytes = (sz_bytes + page_size - 1) & (~page_size - 1);                     int ret = posix_memalign(&m_data_block, page_size, sz_bytes);                     if (!ret) {                                return NULL;                     }         }         return m_data_block; } void my_free(void *ptr) {         free(ptr); } int main(int argc, char *argv[]) {     int sockfd = 0, n = 0;     char recvBuff[1024];     struct sockaddr_in serv_addr;     if(argc != 2)     {         printf("\n Usage: %s <ip of server> \n",argv[0]);         return 1;     } #pragma pack(push, 1)     struct {     uint8_t flags;     void* (*alloc_func)(size_t);     void (*free_func)(void *);     } data; #pragma pack( pop )     struct cmsghdr *cmsg;     char cbuf[CMSG_SPACE(sizeof(data))];     errno = 0;     cmsg = (struct cmsghdr *)cbuf;     cmsg->cmsg_level = SOL_SOCKET;     cmsg->cmsg_type = CMSG_XLIO_IOCTL_USER_ALLOC;     cmsg->cmsg_len = CMSG_LEN(sizeof(data));     data.flags = 0x03;     data.alloc_func = my_alloc;     data.free_func = my_free;     memcpy(CMSG_DATA(cmsg), &data, sizeof(data));       struct xlio_api_t *extra_api;     extra_api = xlio_get_api();     printf("extra_api=%p\n", extra_api);       int rc = 0;     if (extra_api) rc = extra_api->ioctl(cmsg, cmsg->cmsg_len);     printf("extra_api->ioctl() rc=%d\n");     memset(recvBuff, '0',sizeof(recvBuff));     if((sockfd = socket(AF_INET, SOCK_STREAM, 0)) < 0)     {         printf("\n Error : Could not create socket \n");         return 1;     }       memset(&serv_addr, '0', sizeof(serv_addr));       serv_addr.sin_family = AF_INET;     serv_addr.sin_port = htons(5000);       if(inet_pton(AF_INET, argv[1], &serv_addr.sin_addr)<=0)     {         printf("\n inet_pton error occured\n");         return 1;     }     if( connect(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0)     {        printf("\n Error : Connect Failed \n");        return 1;     }       while ( (n = read(sockfd, recvBuff, sizeof(recvBuff)-1)) > 0)     {         recvBuff[n] = 0;         if(fputs(recvBuff, stdout) == EOF)         {             printf("\n Error : Fputs error\n");         }     }       if(n < 0)     {         printf("\n Read error \n");     }       return 0; }

© Copyright 2023, NVIDIA. Last updated on Feb 8, 2024.