XLIO Extra API
The information in this chapter is intended for application developers that want to maximize XLIO performance may use the Extra API and achieve the following:
To further lower latencies
To increase throughput
To gain additional CPU cycles for the application logic
To better control XLIO offload capabilities
All socket applications are limited to the given Socket API interface functions.
The XLIO Extra API enables XLIO to open a new set of functions which allow the application developer to add code which utilizes zero copy receive function calls and low-level packet filtering by inspecting the incoming packet headers or packet payload at a very early stage in the processing.
XLIO is designed as a dynamically linked user-space library. As such, the XLIO Extra API has been designed to allow the user to dynamically load XLIO and to detect at runtime if the additional functionality described here is available or not. The application is still able to run over the general socket library without XLIO loaded as it did previously, or can use an application flag to decide which API to use: Socket API or XLIO Extra API.
The XLIO Extra APIs are provided as a header with the XLIO binary rpm. The application developer needs to include this header file in his application code.
After installing the XLIO RPM on the target host, the XLIO Extra APIs header file is located in the following link:
#include "/usr/include/mellanox/xlio_extra.h"
The xlio_extra.h provides detailed information about the various functions and structures, and instructions on how to use them.
An example using the XLIO Extra API can be seen in the udp_lat source code. Follow the ‘--xliozcopyread’ flag for the zero copy recvfrom logic.
A specific example for using the TCP zero copy extra API can be seen under extra_api_tests/tcp_zcopy_cb.
During runtime, use the xlio_get_api() function to check if XLIO is loaded in your application and if XLIO Extra API is accessible.
If the function returns with NULL, either XLIO is not loaded with the application, or the XLIO Extra API is not compatible with the header function used for compiling your application. NULL will be the typical return value when running the application on native OS without XLIO loaded. On success the function returns a valid api() pointer and NULL on failure.
Any non-NULL return value is a xlio_api_t type structure pointer that holds pointers to the specific XLIO Extra API function calls which are needed for the application to use. Available functions can be checked using special bit mask field as cap_mask.
It is recommended to call xlio_get_api()once on startup, and to use the returned pointer throughout the life of the process.
There is no need to ‘release’ this pointer in any way.
Adding libxlio.conf Rules During Run-Time
Adds a libxlio.conf rule to the top of the list. This rule will not apply to existing sockets which already considered the conf rules. (around connect/listen/send/recv ..)
Syntax:
int
(*add_conf_rule)(char
*config_line);
Return value:
0 on success
error code on failure
Where:
config_line
Description – new rule to add to the top of the list (highest priority)
Value – a char buffer with the exact format as defined in libxlio.conf, and should end with '\0'
Creating Sockets as Offloaded or Not-Offloaded
Creates sockets on pthread tid as off-loaded/not-off-loaded. This does not affect existing sockets. Offloaded sockets are still subject to libxlio.conf rules.
Usually combined with the XLIO_OFFLOADED_SOCKETS parameter.
Syntax:
int
(*thread_offload)(int
offload, pthread_t tid);
Return value:
0 on success
error code on failure
Where:
offload
Description – Offload property
Value – 1 for offloaded, 0 for not-offloaded
tid
Description – thread ID
Zero Copy recvfrom()
Zero-copy recvfrom implementation. This function attempts to receive a packet without doing data copy.
Syntax:
int
(*recvfrom_zcopy)(int
s, void
*buf, size_t len, int
*flags, struct sockaddr *from, socklen_t *fromlen);
Where:
Parameter Name |
Description |
Values |
s |
Socket file descriptor |
|
buf |
Buffer to fill with received data or pointers to data (see below). |
|
flags |
Pointer to flags (see below). |
Usual flags to recvmsg(), and MSG_XLIO_ ZCOPY_FORCE |
from |
If not NULL, is set to the source address (same as recvfrom) |
|
fromlen |
If not NULL, is set to the source address size (same as recvfrom). |
The flags parameter can contain the usual flags to recvmsg(), and also the MSG_XLIO_ZCOPY_FORCE flag. If the latter is not set, the function reverts to data copy (i.e., zero-copy cannot be performed). If zero-copy is performed, the flag MSG_XLIO_ZCOPY is set upon exit.
If zero copy is performed (MSG_XLIO_ZCOPY flag is returned), the buffer is filled with a xlio_recvfrom_zcopy_packets_t structure holding as much fragments as `len' allows. The total size of all fragments is returned. Otherwise, the buffer is filled with actual data, and its size is returned (same as recvfrom()).
If the return value is positive, data copy has been performed. If the return value is zero, no data has been received.
Freeing Zero Copied Packet Buffers
Frees a packet received by "recvfrom_zcopy()" or held by "receive callback".
Syntax:
int
(*recvfrom_zcopy_free_packets)(int
s, struct xlio_recvfrom_zcopy_packet_t *pkts , size_t count);
Where:
s – socket from which the packet was received
pkts – array of packet identifiers
count – number of packets in the array
Return value:
0 on success, -1 on failure
errno is set to:
EINVAL – not a offloaded socket
ENOENT – the packet was not received from 's'.
Example:
entry Source Source-mask Dest Dest-mask Interface Service Routing Status Log
|------|------------|---------------|-----|----------|-
1
any any any any if0 any tunneling active 1
2
192.168
.2.0
255.255
..255.0
any any if1 any tunneling active 1
Expected result:
sRB-20210G-61f0(statistic)# log show counter tx total pack tx total byte
rx total pack rx total byte
|------|-------------|-------------|-------------|--------------
1
2733553
268066596
3698
362404
Parameter |
Description |
tx total byte |
The number of transmit bytes associated with a TFM rule; has a log counter n.The above example shows the number of bytes sent from Infiniband to Ethernet (one way) or sent between InfiniBand and Ethernet and matching the two TFM rules with log counter #1. |
rx total pack |
The number of receive packets associated with a TFM rule; has a log counter n. |
rx total byte |
The number of receive bytes associated with a TFM rule; has a log counter n. |
The XLIO SocketXtreme API has been developed to optimize the data path of the socket API, while preserving the familiar standard socket API for control operations, such as select(), poll(), epoll_wait(), recv(), recvfrom(), recvmsg(), read(), and readv().
The XLIO SocketXtreme API enhances application performance in the following ways:
Reduced Context Switching: The lightweight library call to socketxtreme_poll(...) eliminates the need for traditional methods like `poll()`, `select()`, `epoll()`, and similar interfaces, as well as subsequent read(), recv(), recvmsg(), and other socket-related system calls. It achieves asynchronous I/O for both data reception (RX) and data transmission (TX) by concurrently polling multiple sockets on the same ring (more information about rings is provided in subsequent sections).
Comprehensive Data Handling: The socketxtreme_poll function provides the user application with detailed data in the form of the struct xlio_socketxtreme_completion_t. This structure, elaborated upon below, equips the application to effectively manage the status of sockets associated with the ring file descriptor and process the data received through these sockets.
Usage
Verify SocketXtreme support
To employ the XLIO extra API and the SocketXtreme interface, follow these steps:
1. Obtain the XLIO API using the `xlio_get_api` function.
2. Verify the `XLIO_MAGICNUMBER` to ensure compatibility.
3. Verify the XLIO capabilities, as demonstrated below.
#include <mellanox/xlio extra .h>
static
struct xlio api t •get xlio_ api (void
)
{
struct xlio_ api_ t •api_ ptr = NULL; int
err = xlio_ get_ api ():
if
(err < 0
) {
return
NULL;
}
if
(api_ptr NULL) {
return
NULL;
}
if
(api_ptr->maqic != XLIO_MAGICNOMBER) {
printf ("Unexpected XLIO AP! magic number : expected %"
PRix64 ", got %"
PRix64 "\n"
,
(uint64_t)XLIO_MAGIC_NOMBER , api_ptr->maqic);
goto
failed ;
}
uint64_t required_caps = XLIO_EXTRA_API_GET_SOCKET_RINGS_FDS |
XLIO_EXTRA_API_SOCKETXTREME_POLL |
XLIO_EXTRA_API_SOCKETXTREME_FREE_PACKETS |
XLIO EXTRA AP IOCTL;
if
((api_ ptr->cap_mask & required_ caps) != required_ caps) {
printf ("Required XLIO caps are missing: required %"
PRix64 ", got %"
PRix64 "\n"
,
required_ caps, api_ ptr->cap_mask );
goto
failed ;
}
return
api_ ptr :
failed :
free (api_ptr):
return
NULL;
}
Replace the socket file descriptor with alternative identifier
XLIO introduces the capability to substitute a socket file descriptor with user-defined data. To utilize this feature, XLIO provides a new setsockopt option, as illustrated in the example below.
uint64 t user_data = (uintptr_t)user_ptr;
int
re = xlio_api->setsockopt (socket_ fd, SOL_SOCKET, SO_XLIO_USER_DATA ,
&user_data , sizeof (user_data));
if
( re != 0
) {
printf ("Failed to set socket user data for sock %d: re %d, errno %d\n"
, socket_ fd, re, errno);
goto
fail;
}
Get ring file descriptors(s) from socket file descriptor
The ring file descriptor(s) will be required as arguments to socktxtreme_poll.
int
ring_fds[2
];
int
num_rings;
hum_rings = xlio_api->get socket rings_ fds (socket_ fd, ring_ fds, 2
);
if
(num_rinqs < 0
)
{printf ("Failed to get ring FDs for socket errno %d: rc %d, errno %d\n"
,
socket_ fd, num_rings, for
socket errno);
}
Parameter/Return value |
Description |
num_rings |
The actual number of rings returned by the function. If the number is -1 the function failed check the errno |
ring_fds |
An array of integers to be filled upon success |
num_ring_fds |
The maximal number of ring descriptors to fill |
Poll ring fd
The ring file descriptor(s) will be required as arguments to socktxtreme_poll.
struct xlio_socketxtreme_completion_t comps[11AX_EVENTS_ PER_ POLLJ ;
int
num_events = xlio_api->socketxtreme_poll(ring_ fd, comps,
MAX_EVENTS_ PER_ POLL, SOCKETXTREME_ POLL_TX) ;
Parameter |
Description |
num_events |
The actual number of events returned by the function. If the number is -1 the function failed check the errno |
ring_fd |
The ring file descriptor to poll |
comps |
An array of XLIO completion events |
Processing the completions
For detailed xlio_socket_xtreme_completion_t please review the xlio_extra.h header file.
When the `XLIO_SOCKETXTREME_PACKET` flag is enabled within the `xlio_completion_t.events` field, it indicates that the completion is associated with the descriptor of an incoming packet. You can access this descriptor through the `xlio_completion_t.packet` field. This descriptor points to XLIO buffers that hold data distributed by the hardware, ensuring that the data is delivered to the application without the need for copying. It's essential to remember that after the application has finished using the returned packets and their associated buffers, they must be released using the `free_xlio_packets()` and `free_xlio_buff()` functions. If the `XLIO_SOCKETXTREME_PACKET` flag is disabled, the `xlio_completion_t.packet` field remains reserved.
In addition to indicating the arrival of a packet, XLIO also reports the `XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED` event and standard epoll events through the `xlio_completion_t.events` field. The `XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED` event is triggered when a new connection is accepted by the server. When using `socketxtreme_poll()`, new connections are accepted automatically, and there's no need to explicitly call `accept()` on the listen socket. The `XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED` event pertains to the newly connected or child socket, with `xlio_completion_t.user_data` referring to the child socket. For events other than packet arrival and new connection acceptance, the `xlio_completion_t.events` bitmask is constructed using standard epoll API event types. It's important to note that the same completion can report multiple events; for instance, the `XLIO_SOCKETXTREME_PACKET` flag can be enabled alongside `EPOLLOUT` events and more.
for
(1
= 0
; i <num_events; i++ {
struct xlio_socketxtreme_completion_t *comp = &comps[i);
assert
(comp->user_data != 0
):
/* Convert the user data to an application identifier and consume he data.
Below we assume that the application converts the identifier to app_sink pointer.
*/
app sink t •app_sink = (app_sink_t *)comp->user_data ;
if
(comp->events & EPOLLERR) {
app_sink->process_error_cb ():
}
if
(comp->events & XLIO_SOCKETXTREME_PACKET) {
app_sink->process_packet (comp->packet):
}
if
(comp->events & XLIO_SOCKETXTREME_NEW_CONNECTION_ACCEPTED) {
app_sink->process_accepted_connection (comp->packet):
}
}
Freeing packets and buffers
In the following example, we iterate through the buffers of a specific packet, free them, and then release the packet.
static
inline void
free_xlio_packet (struct xlio_socketxtreme_packet_desc_t
*packet)
{
assert
(packet != NULL);
while
(packet->num_bufs--)
{
assert
(packet->buff_lst != NULL);
struct xlio buff t *buff to free = packet->buff_lst;
packet->buff_lst = packet->buff_lst->next;
xlio_api->socketxtreme_ free_buff (buff_to_ free);
}
xlio_api->socketxtreme_ free_packets(packet, 1
);
}
The packet filter logic allows developers to inspect and dynamically decide whether to keep or drop incoming packets. The user's packet filtering callback follows this prototype:
typedef xlio_recv_callback_retval_t (*xlio_recv_callback_t) (int
fd, size_t
sz_iov, struct iovec iov[], struct xlio_info_t *xlio_info, void
*context);
To register this callback function with XLIO, use the register_recv_callback() function provided by the XLIO Extra API. If you wish to unregister the callback, simply set the function pointer to NULL.
XLIO invokes this callback to inform the application about new incoming packets. This notification occurs after the internal processing of IP and UDP/TCP headers and precedes the queuing of these packets in the socket's receive queue.
The context of the callback is always that of one of the user's application threads that have previously called one of the following socket APIs: select(), poll(), epoll_wait(), recv(), recvfrom(), recvmsg(), read(), or readv().
fd |
File descriptor of the socket to which this packet refers. |
iov |
iovector structure array pointer holding the packet received, data buffer pointers, and the size of each buffer. |
iov sz |
Size of the iov array. |
xlio info |
Additional information on the packet and socket. |
context |
User-defined value provided during callback registration for each socket. |
The application is allowed to invoke all Socket APIs from within the callback context. However, it's important to note that depending on how the application behaves in this context, packet loss may occur. In cases where the callback behavior is extremely rapid and non-blocking, it's less likely to lead to packet loss.
Regarding the parameters "iov" and "xlio_info," it's crucial to understand that their validity is limited to the duration of the callback context. If you intend to work with zero-copy logic and require these structures for later use, it's advisable to make copies of them before exiting the callback context.
Dumps statistics for fd number using log_level log level.
Syntax:
int
(*dump_fd_stats) (int
fd, int
log_level);
Parameters:
Parameter |
Description |
fd |
fd to dump, 0 for all open fds. |
log_level |
log_level dumping level corresponding vlog_levels_t enum (vlogger.h): VLOG_NONE = -1 VLOG_PANIC = 0 VLOG_ERROR = 1 VLOG_WARNING = 2 VLOG_INFO =3 VLOG_DETAILS = 4 VLOG_DEBUG = 5 VLOG_FUNC = VLOG_FINE = 6 VLOG_FUNC_ALL = VLOG_FINER = 7 VLOG_ALL = 8 |
For output example see section Monitoring – the xlio_stats Utility . Return values: 0 on success, -1 on failure
The “Dummy Send” feature gives the application developer the capability to send dummy packets in order to warm up the CPU caches on XLIO send path, hence minimizing any cache misses and improving latency. The dummy packets reaches the hardware NIC and then is dropped.
The application developer is responsible for sending the dummy packets by setting the XLIO_SND_FLAGS_DUMMY bit in the flags parameter of send(), sendto(), sendmsg(), and sendmmsg() sockets API.
Parameters:
Parameter |
Description |
XLIO_SND_FLAGS_DUMMY |
Indicates a dummy packet |
Same as the original APIs for offloaded sockets. Otherwise, -1 is returned and errno is set to EINVAL.Return values:
Usage example:
void
dummyWait(Timer waitDuration, Timer dummySendCycleDuration) { Timer now = Timer::now(); Timer endTime = now + waitDuration; Timer nextDummySendTime = now + dummySendCycleDuration; for
( ; now < endTime ; now = Timer::now()) { if
(now >= nextDummySendTime) { send(fd, buf, len, XLIO_SND_FLAGS_DUMMY); nextDummySendTime += dummySendCycleDuration; } } }
This sample code consistently sends dummy packets every DummysendCycleDuration using the XLIO extra API while the total time does not exceed waitDuration.
It is recommended not to send more than 50k dummy packets per second.
Verifying “Dummy Send” Capability in HW
In order to verify “Dummy Send” capability in the hardware, run XLIO with DEBUG trace level.
XLIO_TRACELEVEL=DEBUG LD_PRELOAD=<path to libxlio.so> <command line>
Look in the printout for “HW Dummy send support for QP = [0|1]”.
For example:
Pid: 3832
Tid: 3832
XLIO DEBUG: qpm[0x2097310
]:121
:configure() Creating QP of transport type 'ETH'
on ibv device 'mlx5_0'
[0x201e460
] on port 1
Pid: 3832
Tid: 3832
XLIO DEBUG: qpm[0x2097310
]:137
:configure() HW Dummy send support for
QP = 1
Pid: 3832
Tid: 3832
XLIO DEBUG: cqm[0x203a460
]:269
:cq_mgr() Created CQ as Tx with fd[25
] and of size 3000
elements (ibv_cq_hndl=0x20a0000
)
“Dummy Packets” Statistics
Run xlio_stats tool to view the total amount of dummy-packets sent.
xlio_stats –p <pid> -v 3
The number of dummy messages sent will appear under the relevant fd. For example:
====================================================== Fd=[20
] - UDP, Blocked, MC Loop Enabled - Local Address = [0.0
.0.0
:56732
] Tx Offload: 128
/ 9413
/ 0
/ 0
[kilobytes/packets/drops/errors] Tx Dummy messages : 87798
Rx Offload: 128
/ 9413
/ 0
/ 0
[kilobytes/packets/eagains/errors] Rx byte
: cur 0
/ max 14
/ dropped 0
/ limit 212992
Rx pkt : cur 0
/ max 1
/ dropped 0
Rx poll: 0
/ 9411
(100.00
%) [miss/hit] ======================================================
This function allows to communicate with library using extendable protocol
based on struct cmshdr.
Syntax:
int
(*ioctl) (void
*cmsg_hdr, size_t cmsg_len);
Parameters:
Parameter |
Description |
cmsg_hdr |
The address of the ancillary data. |
cmsg_len |
The length of the ancillary data is passed in cmsg_hdr. Note that if multiple ancillary data sections are being passed, this length should reflect the total length of ancillary data sections |
The cmsg_hdr parameter points to the ancillary data. This cmsg_hdr pointer points to the following structure (C/C++ example shown) that describes the ancillary data.
struct cmsghdr {
size_t cmsg_len; /* data byte count includes hdr */
int
cmsg_level; /* originating protocol */
int
cmsg_type; /* protocol-specific type */
/* followed by u_char cmsg_data[]; */
};
Ancillary data is a sequence of cmsghdr structures with appended data. The sequence of cmsghdr structures should never be accessed directly. Instead, use only the following macros: CMSG_ALIGN, CMSG_SPACE, CMSG_DATA, CMSG_LEN.
Guidelines:
The cmsg_len should be set to the length of the cmsghdr plus the length of all ancillary data that follows immediately after the cmsghdr. This is represented by the commented out cmsg_data field.
The cmsg_level should be set to the option level (for example, SOL_SOCKET).
The cmsg_type should be set to the option name (for example, CMSG_XLIO_IOCTL_USER_ALLOC).
Supported commands:
Command |
Description |
CMSG_XLIO_IOCTL_USER_ALLOC |
Use user defined function to allocate global pools |
CMSG_XLIO_IOCTL_USER_ALLOC
Filed size |
Description |
uint8_t |
control flags |
uintptr_t |
pointer to memory allocation function |
uintptr_t |
pointer to memory free function |
Control Flags
enum
{
IOCTL_USER_ALLOC_TX = (1
<< 0
),
IOCTL_USER_ALLOC_RX = (1
<< 1
),
IOCTL_USER_ALLOC_TX_ZC = (1
<< 2
)
};
Usage Example
In this example, the application uses CMSG_XLIO_IOCTL_USER_ALLOC command.
#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <arpa/inet.h>
#include <mellanox/xlio_extra.h>
void
* my_alloc(size_t sz_bytes)
{
void
*m_data_block = NULL;
long
page_size = sysconf(_SC_PAGESIZE);
if
(page_size > 0
) {
sz_bytes = (sz_bytes + page_size - 1
) & (~page_size - 1
);
int
ret = posix_memalign(&m_data_block, page_size, sz_bytes);
if
(!ret) {
return
NULL;
}
}
return
m_data_block;
}
void
my_free(void
*ptr)
{
free(ptr);
}
int
main(int
argc, char
*argv[])
{
int
sockfd = 0
, n = 0
;
char
recvBuff[1024
];
struct sockaddr_in serv_addr;
if
(argc != 2
)
{
printf("\n Usage: %s <ip of server> \n"
,argv[0
]);
return
1
;
}
#pragma pack(push, 1
)
struct {
uint8_t flags;
void
* (*alloc_func)(size_t);
void
(*free_func)(void
*);
} data;
#pragma pack( pop )
struct cmsghdr *cmsg;
char
cbuf[CMSG_SPACE(sizeof(data))];
errno = 0
;
cmsg = (struct cmsghdr *)cbuf;
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = CMSG_XLIO_IOCTL_USER_ALLOC;
cmsg->cmsg_len = CMSG_LEN(sizeof(data));
data.flags = 0x03
;
data.alloc_func = my_alloc;
data.free_func = my_free;
memcpy(CMSG_DATA(cmsg), &data, sizeof(data));
struct xlio_api_t *extra_api;
extra_api = xlio_get_api();
printf("extra_api=%p\n"
, extra_api);
int
rc = 0
;
if
(extra_api) rc = extra_api->ioctl(cmsg, cmsg->cmsg_len);
printf("extra_api->ioctl() rc=%d\n"
);
memset(recvBuff, '0'
,sizeof(recvBuff));
if
((sockfd = socket(AF_INET, SOCK_STREAM, 0
)) < 0
)
{
printf("\n Error : Could not create socket \n"
);
return
1
;
}
memset(&serv_addr, '0'
, sizeof(serv_addr));
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(5000
);
if
(inet_pton(AF_INET, argv[1
], &serv_addr.sin_addr)<=0
)
{
printf("\n inet_pton error occured\n"
);
return
1
;
}
if
( connect(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0
)
{
printf("\n Error : Connect Failed \n"
);
return
1
;
}
while
( (n = read(sockfd, recvBuff, sizeof(recvBuff)-1
)) > 0
)
{
recvBuff[n] = 0
;
if
(fputs(recvBuff, stdout) == EOF)
{
printf("\n Error : Fputs error\n"
);
}
}
if
(n < 0
)
{
printf("\n Read error \n"
);
}
return
0
;
}
General Information
XLIO Socket API is an event-based API for the high-performance scenarios. This is a non-standard API and requires the application to be integrated explicitly.
XLIO Socket API triggers a callback immediately when a respective event happens. This reduces latency and simplifies handling of the events. The API also allows to avoid events aggregation if they turn out to be unnecessary.
There are two ways to call the API:
Direct function calls: The prototypes are declared in <mellanox/xlio.h>. This approach requires explicit linkage with XLIO static library.
Indirect function calls by the pointers which are provided by xlio_get_api(). The prototypes are declared in <mellanox/xlio_extra.h>.
Common types are defined in <mellanox/xlio_types.h>, which is included implicitly by the above headers.
Current limitations:
Only TCP sockets are supported.
Only polling mode is supported.
No listen sockets support.
For a sample application, please refer to tests/extra_api/xlio_socket_api.c within the XLIO sources.
Global Initialization
XLIO Socket API requires explicit global initialization before using any other functions. The initialization is a heavy process and is expected to be performed in advance.
Types definitions
struct
xlio_init_attr {
unsigned flags;
xlio_memory_cb_t memory_cb;
/* Optional external user allocator for XLIO buffers. */
void
*(*memory_alloc)(size_t
);
void
(*memory_free)(void
*);
};
Where
Field |
Description |
flags |
Global flags. Currently unused |
memory_cb |
An optional callback called when XLIO allocates memory for data buffers. Zerocopy RX buffers points to such memory only. User can use this information to prepare the allocated memory for further processing of the zerocopy RX data |
memory_alloc, memory_free |
An optional external allocator to be used for the data buffers. The external allocator and memory_cb are orthogonal and may be used together |
Syntax
Global initialization
int
xlio_init_ex(const
struct
xlio_init_attr *attr);
Where
Argument |
Description |
attr |
Global attributes |
Return value
Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.
User should finalize XLIO library with xlio_exit() when it is no longer needed. Usually, this is done during the termination phase. Both XLIO Socket API and intercepted POSIX API may not be used after the finalization.
Syntax
Global finalization
xlio_exit();
XLIO Polling Groups
An XLIO polling group is a collection of XLIO sockets and their internal auxiliary objects. An XLIO polling group is represented by the opaque xlio_poll_group_t type.
Polling groups do not share objects. Thus, object migration between groups is not supported.
Operations with different polling groups do not overlap, except for unlikely protected access to global pools. Therefore, multiple polling groups can work in parallel without serialization.
Operations with the same polling group must be serialized.
Polling groups are not bound to CPU/thread. It is allowed to use a single polling group on multiple CPUs if serialization is guaranteed. For example, this approach can be used for a polling group migration implementation.
Recommendations:
Polling groups are expected to be long-lived objects.
It is expected to use polling group per CPU/thread and probably a small number of extra groups.
Each polling group creates HW objects per utilized network interface. Minimizing the number of utilized network interfaces per group will improve HW resources utilization.
A major part of the XLIO activities is done in the context of xlio_poll_group_poll() call. Therefore, this function should be called frequently enough to reduce latency and avoid runtime issues such as timeouts and TCP retransmissions.
Flags definitions
#define XLIO_GROUP_FLAG_SAFE 0x1
#define XLIO_GROUP_FLAG_DIRTY 0x2
Where
Flag |
Description |
XLIO_GROUP_FLAG_SAFE |
Relaxes thread-safety requirements: allows to call a send operation concurrently with the polling group operations. However, all the group operations and socket creation/destruction still must be serialized. Concurrent send operations still must be serialized. This flag has a runtime cost and is expected to be used for performance non-critical sockets |
XLIO_GROUP_FLAG_DIRTY |
Requests the group to track dirty sockets. Required for xlio_poll_group_flush() to function |
Types definitions
struct
xlio_poll_group_attr {
unsigned flags;
void
(*socket_event_cb)(xlio_socket_t, uintptr_t
userdata_sq, int
event, int
value);
void
(*socket_comp_cb)(xlio_socket_t, uintptr_t
userdata_sq, uintptr_t
userdata_op);
void
(*socket_rx_cb)(xlio_socket_t, uintptr_t
userdata_sq, void
*data, size_t
len,
struct
xlio_buf *buf);
};
Where
Field |
Description |
flags |
Polling group flags |
socket_event_cb |
Mandatory callback for socket events |
socket_comp_cb |
Completion callback for zerocopy send operations |
socket_rx_cb |
Callback for RX data delivery |
Syntax
Creating XLIO polling group
int
xlio_poll_group_create(const
struct
xlio_poll_group_attr *attr, xlio_poll_group_t *group_out);
Where
Argument |
Description |
attr |
Polling group attributes |
group_out |
On success, the created polling group is saved there |
Return value
Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.
Syntax
XLIO polling group destruction
int
xlio_poll_group_destroy(xlio_poll_group_t group);
Where
Argument |
Description |
group |
XLIO polling group |
Return value
Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.
Syntax
Polling
void
xlio_poll_group_poll(xlio_poll_group_t group);
Where
Argument |
Description |
group |
XLIO polling group |
XLIO Sockets
XLIO socket is similar to the POSIX socket, except it has a separate non-overlapping API. An XLIO socket is represented by the opaque xlio_socket_t type.
XLIO sockets have the following properties:
Always non-blocking.
No partial write support. Either all the data is accepted, or the call fails.
Types definitions
struct
xlio_socket_attr {
unsigned flags;
int
domain; /* AF_INET or AF_INET6 */
xlio_poll_group_t group;
uintptr_t
userdata_sq;
};
Where
Field |
Description |
flags |
Socket flags, currently unused |
domain |
Address family: either AF_INET or AF_INET6 |
group |
XLIO polling group |
userdata_sq |
Opaque per-socket userdata |
Syntax
XLIO socket creation
int
xlio_socket_create(const
struct
xlio_socket_attr *attr, xlio_socket_t *sock_out);
Where
Argument |
Description |
attr |
Socket attributes |
sock_out |
On success, the created socket object is saved there |
Return value
Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.
Syntax
XLIO socket destruction
int
xlio_socket_destroy(xlio_socket_t sock);
Where
Argument |
Description |
sock |
XLIO socket object. |
Return value
Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.
Syntax
Connect XLIO socket
int
xlio_socket_connect(xlio_socket_t sock, const
struct
sockaddr *to, socklen_t tolen);
Where
Argument |
Description |
sock |
XLIO socket object |
to |
Remote address to connect to |
tolen |
Length of the address object |
Return value
Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error. Asynchronous connect is a success, therefore, EINPROGRESS and EAGAIN errors are not possible. The result of an asynchronous connect is delivered with the socket event callback. Subsequent xlio_socket_connect() calls are ignored and their return code is undefined.
xlio_socket_setsockopt() and xlio_socket_bind() duplicate setsockopt(2) and bind(2) functionality respectively.
Syntax
setsockopt and bind
int
xlio_socket_setsockopt(xlio_socket_t sock, int
level, int
optname, const
void
*optval, socklen_t optlen);
int
xlio_socket_bind(xlio_socket_t sock, const
struct
sockaddr *addr, socklen_t addrlen);
XLIO exposes protection domain as ibv_pd object. The protection domain is related to the outgoing device used by the socket. It is expected to have a protection domain per outgoing interface and, as a result, sockets can share the same object depending on the remote IP address configuration.
xlio_socket_pd() should be called after XLIO determines the outgoing device for the socket, which happens in the context of xlio_socket_connect().
The main purpose of the exposed protection domain is to perform memory registration for user’s TX data buffers which will be used in the TX zerocopy path. See ibv_reg_mr(3) and “TX Data Path” for details. See XLIO Socket sample application for an example.
Syntax
Protection domain
struct
ibv_pd *xlio_socket_get_pd(xlio_socket_t sock);
Where
Argument |
Description |
sock |
XLIO socket object |
Return value
Returns protection domain for the socket on success. On error, NULL is returned. The function fails util xlio_socket_connect() is called for the respective socket.
XLIO Socket Events
The socket event callback is configured per polling group with xlio_poll_group_attr:: socket_event_cb().
Most of the socket events are delivered from xlio_poll_group_poll() context, except for XLIO_SOCKET_EVENT_TERMINATED, which can be triggered from the xlio_socket_destroy() context.
Socket event callback applies the following restrictions on the socket operations:
Send operations are allowed only while processing XLIO_SOCKET_EVENT_ESTABLISHED event.
xlio_socket_destroy() is not allowed.
Send operation in the callback is allowed only for the XLIO_SOCKET_EVENT_ESTABLISHED event.
Syntax
Socket event callback
enum
{
/* TCP connection established. */
XLIO_SOCKET_EVENT_ESTABLISHED = 1,
/* Socket terminated and no further events are possible. */
XLIO_SOCKET_EVENT_TERMINATED,
/* Passive close. */
XLIO_SOCKET_EVENT_CLOSED,
/* An error occurred, see the error code value. */
XLIO_SOCKET_EVENT_ERROR,
};
typedef
void
(*xlio_socket_event_cb_t)(xlio_socket_t sock, uintptr_t
userdata_sq, int
event, int
value);
Where
Argument |
Description |
sock |
XLIO socket object |
userdata_sq |
Opaque user data which is defined during socket creation |
event |
Represents the event |
value |
Holds a POSIX error code for the XLIO_SOCKET_EVENT_ERROR event. Should be ignored for other events |
Possible error codes for the XLIO_SOCKET_EVENT_ERROR event:
ECONNABORTED - connection aborted by the local side
ECONNRESET - connection reset by the remote side
ECONNREFUSED - connection refused by the remote side during TCP handshake
ETIMEDOUT - connection timed out due to keepalive, user timeout option or TCP handshake timeout
TX Data Path
TX path performs data aggregation until user requests a flush. This allows to avoid data aggregation on the user level and explicitly control sending of more optimal big packets. There are 3 ways to flush sockets:
Polling group level flush with xlio_poll_group_flush()
Socket level flush with xlio_socket_flush()
Socket level flush with XLIO_ SEND_FLAG_FLUSH flag in a send operation
It is recommended to use only group level flush for polling groups with XLIO_GROUP_FLAG_DIRTY flag. And use socket level flush for sockets from a group without the flag.
Nagle algorithm remains effective for the XLIO sockets, however, it is recommended to use explicit flush mechanism and disable Nagle algorithm with either TCP_NODELAY option or XLIO_TCP_NODELAY parameter.
By default, send operations are zerocopy. The memory with data must be registered in advance in the XLIO protection domain. See xlio_socket_get_pd() and ibv_reg_mr(3).
XLIO_SEND_FLAG_INLINE flag forces XLIO to copy data to its internal buffers. An inline send operation does not take ownership on the data memory and the respective buffers may be reused immediately after the call returns. Such an operation ignores xlio_send_attr::mkey and xlio_send_attr::userdata_op fields.
There is no partial write, and the TCP send buffer option does not affect the XLIO sockets. XLIO either queues all the data or returns an error. Errors are not recoverable.
Flags definitions
#define XLIO_SEND_FLAG_FLUSH 0x1
#define XLIO_SEND_FLAG_INLINE 0x2
Where
Flag |
Description |
XLIO_SEND_FLAG_FLUSH |
Flush all aggregated data as part of the send operation |
XLIO_SENF_FLAG_INLINE |
Force XLIO to copy the data to its internal buffers |
Types definitions
struct
xlio_send_attr {
unsigned flags;
uint32_t mkey;
uintptr_t
userdata_op;
};
Where
Field |
Description |
flags |
Force XLIO to copy the data to its internal buffers |
mkey |
Memory registration key (e.g. obtained via ibv_reg_mr(3)) |
userdata_op |
Opaque per-operation userdata |
Syntax
Send operation
int
xlio_socket_send(xlio_socket_t sock, void
*data, size_t
len, struct
xlio_send_attr *attr);
int
xlio_socket_sendv(xlio_socket_t sock, struct
iovec *iov, unsigned iovcnt, struct
xlio_send_attr *attr);
Where
Argument |
Description |
sock |
XLIO socket object |
data |
User pointer to the data to send |
len |
Length of the data |
attr |
Send operation attributes |
iov |
Vectorized data to send |
iovcnt |
Number of scatter-gather elements in the iov vector |
Return value
Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.
Syntax
Flush operation
void
xlio_socket_flush(xlio_socket_t sock);
void
xlio_poll_group_flush(xlio_poll_group_t group);
Where
Argument |
Description |
sock |
XLIO socket object. |
group |
XLIO polling group object. |
TX Data Path – Zerocopy Completions
User can request a completion on individual zerocopy send operations. A completion is requested with a non-zero xlio_send_attr::userdata_op value. Zero value in xlio_send_attr::userdata_op disables the completion for the operation. With the completion, XLIO guarantees the following:
The respective data is delivered to the remote side
The data is acknowledged by the TCP protocol
The memory buffer is not used by XLIO
A completion is generated for an operation rather than a buffer. On a completion, user may reuse the respective memory buffers.
XLIO does not guarantee order of the completions. However, completions are likely generated in the same order as their respective send operations.
User may provide duplicate xlio_send_attr::userdata_op value in multiple send operations and XLIO generates multiple completions with duplicated userdata_op argument respectively.
XLIO_SEND_FLAG_INLINE send operations do not generate completions.
Syntax
Zerocopy completion callback
void
(*socket_comp_cb)(xlio_socket_t sock, uintptr_t
userdata_sq, uintptr_t
userdata_op);
Where
Argument |
Description |
sock |
XLIO socket object. |
userdata_sq |
Opaque per-socket userdata. |
userdata_op |
Opaque per-operation userdata. |
RX Data Path
RX payload is delivered with the RX callback and treated as an RX event. There is no data aggregation on the socket layer and data is delivered immediately. However, orthogonal features LRO and GRO can perform aggregation on the lower layers, which can affect latency and data granularity.
RX path is always zerocopy – XLIO provides a pointer to its internal buffer, which is in the user address space. Once the RX buffer is handled, the user is responsible to return the buffer back to XLIO.
The user can use an external allocator and/or notification about RX buffers memory allocation to control the memory area, which is used in the RX path. See “Global initialization” section above for details. If needed, the memory area may be prepared in advance for further handling by the application (e.g. register memory for RDMA operations).
XLIO provides an xlio_buf metadata object which defines xlio_buf::userdata. The field is of uninitialized 8 bytes that can be used by the user during their ownership on the buffer. The user holds ownership on a buffer starting from a respective RX callback and until the buffer is returned back to XLIO.
Syntax
RX data callback
void
(*socket_rx_cb)(xlio_socket_t sock, uintptr_t
userdata_sq, void
*data, size_t
len, struct
xlio_buf *buf);
Where
Argument |
Description |
sock |
XLIO socket object |
userdata_sq |
Opaque per-socket userdata |
data |
Pointer to the payload which points to an XLIO internal buffer |
len |
Data length |
buf |
A buffer metadata object which must be returned back to XLIO |
Syntax
Return RX buffer
void
xlio_socket_buf_free(xlio_socket_t sock, struct
xlio_buf *buf);
void
xlio_poll_group_buf_free(xlio_poll_group_t group, struct
xlio_buf *buf);
Where
Argument |
Description |
sock |
XLIO socket object |
group |
XLIO polling group object |
buf |
The metadata object to be returned back to XLIO |