XLIO Extra API
The information in this chapter is intended for application developers that want to maximize XLIO performance may use the Extra API and achieve the following:
To further lower latencies
To increase throughput
To gain additional CPU cycles for the application logic
To better control XLIO offload capabilities
All socket applications are limited to the given Socket API interface functions. The XLIO Extra API enables XLIO to open a new set of functions which allow the application developer to add code which utilizes zero copy receive function calls and low-level packet filtering by inspecting the incoming packet headers or packet payload at a very early stage in the processing.
XLIO is designed as a dynamically linked user-space library. As such, the XLIO Extra API has been designed to allow the user to dynamically load XLIO and to detect at runtime if the additional functionality described here is available or not. The application is still able to run over the general socket library without XLIO loaded as it did previously, or can use an application flag to decide which API to use: Socket API or XLIO Extra API.
The XLIO Extra APIs are provided as a header with the XLIO binary rpm. The application developer needs to include this header file in his application code.
After installing the XLIO rpm on the target host, the XLIO Extra APIs header file is located in the following link:
#include "/usr/include/mellanox/xlio_extra.h"
The xlio_extra.h provides detailed information about the various functions and structures, and instructions on how to use them.
An example using the XLIO Extra API can be seen in the udp_lat source code. Follow the ‘--xliozcopyread’ flag for the zero copy recvfrom logic.
A specific example for using the TCP zero copy extra API can be seen under extra_api_tests/tcp_zcopy_cb.
During runtime, use the xlio_get_api() function to check if XLIO is loaded in your application and if XLIO Extra API is accessible.
If the function returns with NULL, either XLIO is not loaded with the application, or the XLIO Extra API is not compatible with the header function used for compiling your application. NULL will be the typical return value when running the application on native OS without XLIO loaded. On success the function returns a valid api() pointer and NULL on failure.
Any non-NULL return value is a xlio_api_t type structure pointer that holds pointers to the specific XLIO Extra API function calls which are needed for the application to use. Available functions can be checked using special bit mask field as cap_mask.
It is recommended to call xlio_get_api()once on startup, and to use the returned pointer throughout the life of the process.
There is no need to ‘release’ this pointer in any way.
Adding libxlio.conf Rules During Run-Time
Adds a libxlio.conf rule to the top of the list. This rule will not apply to existing sockets which already considered the conf rules. (around connect/listen/send/recv ..)
Syntax:
int
(*add_conf_rule)(char
*config_line);
Return value:
0 on success
error code on failure
Where:
config_line
Description – new rule to add to the top of the list (highest priority)
Value – a char buffer with the exact format as defined in libxlio.conf, and should end with '\0'
Creating Sockets as Offloaded or Not-Offloaded
Creates sockets on pthread tid as off-loaded/not-off-loaded. This does not affect existing sockets. Offloaded sockets are still subject to libxlio.conf rules.
Usually combined with the XLIO_OFFLOADED_SOCKETS parameter.
Syntax:
int
(*thread_offload)(int
offload, pthread_t tid);
Return value:
0 on success
error code on failure
Where:
offload
Description – Offload property
Value – 1 for offloaded, 0 for not-offloaded
tid
Description – thread ID
Zero Copy recvfrom()
Zero-copy recvfrom implementation. This function attempts to receive a packet without doing data copy.
Syntax:
int
(*recvfrom_zcopy)(int
s, void
*buf, size_t len, int
*flags, struct sockaddr *from, socklen_t *fromlen);
Where:
Parameter Name | Description | Values |
s | Socket file descriptor | |
buf | Buffer to fill with received data or pointers to data (see below). | |
flags | Pointer to flags (see below). | Usual flags to recvmsg(), and MSG_XLIO_ |
from | If not NULL, is set to the source address (same as recvfrom) | |
fromlen | If not NULL, is set to the source address size (same as recvfrom). |
The flags parameter can contain the usual flags to recvmsg(), and also the MSG_XLIO_ZCOPY_FORCE flag. If the latter is not set, the function reverts to data copy (i.e., zero-copy cannot be performed). If zero-copy is performed, the flag MSG_XLIO_ZCOPY is set upon exit.
If zero copy is performed (MSG_XLIO_ZCOPY flag is returned), the buffer is filled with a xlio_recvfrom_zcopy_packets_t structure holding as much fragments as `len' allows. The total size of all fragments is returned. Otherwise, the buffer is filled with actual data, and its size is returned (same as recvfrom()).
If the return value is positive, data copy has been performed. If the return value is zero, no data has been received.
Freeing Zero Copied Packet Buffers
Frees a packet received by "recvfrom_zcopy()" or held by "receive callback".
Syntax:
int
(*recvfrom_zcopy_free_packets)(int
s, struct xlio_recvfrom_zcopy_packet_t *pkts , size_t count);
Where:
s – socket from which the packet was received
pkts – array of packet identifiers
count – number of packets in the array
Return value:
0 on success, -1 on failure
errno is set to:
EINVAL – not a offloaded socket
ENOENT – the packet was not received from 's'.
Example:
entry Source Source-mask Dest Dest-mask Interface Service Routing Status Log |------|------------|---------------|-----|----------|- 1
any any any any if0 any tunneling active 1
2
192.168
.2.0
255.255
..255.0
any any if1 any tunneling active 1
Expected result:
sRB-20210G-61f0(statistic)# log show counter tx total pack tx total byte
rx total pack rx total byte
|------|-------------|-------------|-------------|-------------- 1
2733553
268066596
3698
362404
Parameter | Description |
tx total byte | The number of transmit bytes associated with a TFM rule; has a log counter n. The above example shows the number of bytes sent from Infiniband to Ethernet (one way) or sent between InfiniBand and Ethernet and matching the two TFM rules with log counter #1. |
rx total pack | The number of receive packets associated with a TFM rule; has a log counter n. |
rx total byte | The number of receive bytes associated with a TFM rule; has a log counter n. |
Dumps statistics for fd number using log_level log level.
Syntax:
int
(*dump_fd_stats) (int
fd, int
log_level);
Parameters:
Parameter | Description |
fd | fd to dump, 0 for all open fds. |
log_level | log_level dumping level corresponding vlog_levels_t enum (vlogger.h): |
For output example see section Monitoring – the xlio_stats Utility. Return values: 0 on success, -1 on failure
The “Dummy Send” feature gives the application developer the capability to send dummy packets in order to warm up the CPU caches on XLIO send path, hence minimizing any cache misses and improving latency. The dummy packets reaches the hardware NIC and then is dropped.
The application developer is responsible for sending the dummy packets by setting the XLIO_SND_FLAGS_DUMMY bit in the flags parameter of send(), sendto(), sendmsg(), and sendmmsg() sockets API.
Parameters:
Parameter | Description |
XLIO_SND_FLAGS_DUMMY | Indicates a dummy packet |
Same as the original APIs for offloaded sockets. Otherwise, -1 is returned and errno is set to EINVAL.Return values:
Usage example:
void
dummyWait(Timer waitDuration, Timer dummySendCycleDuration) { Timer now = Timer::now(); Timer endTime = now + waitDuration; Timer nextDummySendTime = now + dummySendCycleDuration; for
( ; now < endTime ; now = Timer::now()) { if
(now >= nextDummySendTime) { send(fd, buf, len, XLIO_SND_FLAGS_DUMMY); nextDummySendTime += dummySendCycleDuration; } } }
This sample code consistently sends dummy packets every DummysendCycleDuration using the XLIO extra API while the total time does not exceed waitDuration.
It is recommended not to send more than 50k dummy packets per second.
Verifying “Dummy Send” Capability in HW
In order to verify “Dummy Send” capability in the hardware, run XLIO with DEBUG trace level.
XLIO_TRACELEVEL=DEBUG LD_PRELOAD=<path to libxlio.so> <command line>
Look in the printout for “HW Dummy send support for QP = [0|1]”.
For example:
Pid: 3832
Tid: 3832
XLIO DEBUG: qpm[0x2097310
]:121
:configure() Creating QP of transport type 'ETH'
on ibv device 'mlx5_0'
[0x201e460
] on port 1
Pid: 3832
Tid: 3832
XLIO DEBUG: qpm[0x2097310
]:137
:configure() HW Dummy send support for
QP = 1
Pid: 3832
Tid: 3832
XLIO DEBUG: cqm[0x203a460
]:269
:cq_mgr() Created CQ as Tx with fd[25
] and of size 3000
elements (ibv_cq_hndl=0x20a0000
)
“Dummy Packets” Statistics
Run xlio_stats tool to view the total amount of dummy-packets sent.
xlio_stats –p <pid> -v 3
The number of dummy messages sent will appear under the relevant fd. For example:
====================================================== Fd=[20
] - UDP, Blocked, MC Loop Enabled - Local Address = [0.0
.0.0
:56732
] Tx Offload: 128
/ 9413
/ 0
/ 0
[kilobytes/packets/drops/errors] Tx Dummy messages : 87798
Rx Offload: 128
/ 9413
/ 0
/ 0
[kilobytes/packets/eagains/errors] Rx byte
: cur 0
/ max 14
/ dropped 0
/ limit 212992
Rx pkt : cur 0
/ max 1
/ dropped 0
Rx poll: 0
/ 9411
(100.00
%) [miss/hit] ======================================================
This function allows to communicate with library using extendable protocol
based on struct cmshdr.
Syntax:
int
(*ioctl) (void
*cmsg_hdr, size_t cmsg_len);
Parameters:
Parameter | Description |
cmsg_hdr | The address of the ancillary data. |
cmsg_len | The length of the ancillary data is passed in cmsg_hdr. Note that if multiple ancillary data sections are being passed, this length should reflect the total length of ancillary data sections |
The cmsg_hdr parameter points to the ancillary data. This cmsg_hdr pointer points to the following structure (C/C++ example shown) that describes the ancillary data.
struct cmsghdr {
size_t cmsg_len; /* data byte count includes hdr */
int
cmsg_level; /* originating protocol */
int
cmsg_type; /* protocol-specific type */
/* followed by u_char cmsg_data[]; */
};
Ancillary data is a sequence of cmsghdr structures with appended data. The sequence of cmsghdr structures should never be accessed directly. Instead, use only the following macros: CMSG_ALIGN, CMSG_SPACE, CMSG_DATA, CMSG_LEN.
Guidelines:
The cmsg_len should be set to the length of the cmsghdr plus the length of all ancillary data that follows immediately after the cmsghdr. This is represented by the commented out cmsg_data field.
The cmsg_level should be set to the option level (for example, SOL_SOCKET).
The cmsg_type should be set to the option name (for example, CMSG_XLIO_IOCTL_USER_ALLOC).
Supported commands:
Command | Description |
CMSG_XLIO_IOCTL_USER_ALLOC | Use user defined function to allocate global pools |
CMSG_XLIO_IOCTL_USER_ALLOC
Filed size | Description |
uint8_t | control flags |
uintptr_t | pointer to memory allocation function |
uintptr_t | pointer to memory free function |
Control Flags
enum
{
IOCTL_USER_ALLOC_TX = (1
<< 0
),
IOCTL_USER_ALLOC_RX = (1
<< 1
),
IOCTL_USER_ALLOC_TX_ZC = (1
<< 2
)
};
Usage Example
In this example, the application uses CMSG_XLIO_IOCTL_USER_ALLOC command.
#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <arpa/inet.h>
#include <mellanox/xlio_extra.h>
void
* my_alloc(size_t sz_bytes)
{
void
*m_data_block = NULL;
long
page_size = sysconf(_SC_PAGESIZE);
if
(page_size > 0
) {
sz_bytes = (sz_bytes + page_size - 1
) & (~page_size - 1
);
int
ret = posix_memalign(&m_data_block, page_size, sz_bytes);
if
(!ret) {
return
NULL;
}
}
return
m_data_block;
}
void
my_free(void
*ptr)
{
free(ptr);
}
int
main(int
argc, char
*argv[])
{
int
sockfd = 0
, n = 0
;
char
recvBuff[1024
];
struct sockaddr_in serv_addr;
if
(argc != 2
)
{
printf("\n Usage: %s <ip of server> \n"
,argv[0
]);
return
1
;
}
#pragma pack(push, 1
)
struct {
uint8_t flags;
void
* (*alloc_func)(size_t);
void
(*free_func)(void
*);
} data;
#pragma pack( pop )
struct cmsghdr *cmsg;
char
cbuf[CMSG_SPACE(sizeof(data))];
errno = 0
;
cmsg = (struct cmsghdr *)cbuf;
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = CMSG_XLIO_IOCTL_USER_ALLOC;
cmsg->cmsg_len = CMSG_LEN(sizeof(data));
data.flags = 0x03
;
data.alloc_func = my_alloc;
data.free_func = my_free;
memcpy(CMSG_DATA(cmsg), &data, sizeof(data));
struct xlio_api_t *extra_api;
extra_api = xlio_get_api();
printf("extra_api=%p\n"
, extra_api);
int
rc = 0
;
if
(extra_api) rc = extra_api->ioctl(cmsg, cmsg->cmsg_len);
printf("extra_api->ioctl() rc=%d\n"
);
memset(recvBuff, '0'
,sizeof(recvBuff));
if
((sockfd = socket(AF_INET, SOCK_STREAM, 0
)) < 0
)
{
printf("\n Error : Could not create socket \n"
);
return
1
;
}
memset(&serv_addr, '0'
, sizeof(serv_addr));
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(5000
);
if
(inet_pton(AF_INET, argv[1
], &serv_addr.sin_addr)<=0
)
{
printf("\n inet_pton error occured\n"
);
return
1
;
}
if
( connect(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0
)
{
printf("\n Error : Connect Failed \n"
);
return
1
;
}
while
( (n = read(sockfd, recvBuff, sizeof(recvBuff)-1
)) > 0
)
{
recvBuff[n] = 0
;
if
(fputs(recvBuff, stdout) == EOF)
{
printf("\n Error : Fputs error\n"
);
}
}
if
(n < 0
)
{
printf("\n Read error \n"
);
}
return
0
;
}