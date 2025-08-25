XLIO Socket API is an event-based API for the high-performance scenarios. This is a non-standard API and requires the application to be integrated explicitly.

XLIO Socket API triggers a callback immediately when a respective event happens. This reduces latency and simplifies handling of the events. The API also allows to avoid events aggregation if they turn out to be unnecessary.

There are two ways to call the API:

Direct function calls: The prototypes are declared in <mellanox/xlio.h>. This approach requires explicit linkage with XLIO static library. Indirect function calls by the pointers which are provided by xlio_get_api(). The prototypes are declared in <mellanox/xlio_extra.h>.

Common types are defined in <mellanox/xlio_types.h>, which is included implicitly by the above headers.

Current limitations:

Only TCP sockets are supported.

Only polling mode is supported.

No listen sockets support.

For a sample application, please refer to tests/extra_api/xlio_socket_api.c within the XLIO sources.

XLIO Socket API requires explicit global initialization before using any other functions. The initialization is a heavy process and is expected to be performed in advance.

Types definitions Collapse Source Copy Copied! struct xlio_init_attr { unsigned flags; xlio_memory_cb_t memory_cb; void *(*memory_alloc)( size_t ); void (*memory_free)( void *); };

Where

Field Description flags Global flags. Currently unused memory_cb An optional callback called when XLIO allocates memory for data buffers. Zerocopy RX buffers points to such memory only. User can use this information to prepare the allocated memory for further processing of the zerocopy RX data memory_alloc, memory_free An optional external allocator to be used for the data buffers. The external allocator and memory_cb are orthogonal and may be used together

Syntax

Global initialization Collapse Source Copy Copied! int xlio_init_ex( const struct xlio_init_attr *attr);

Where

Argument Description attr Global attributes

Return value

Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.

Note User should finalize XLIO library with xlio_exit() when it is no longer needed. Usually, this is done during the termination phase. Both XLIO Socket API and intercepted POSIX API may not be used after the finalization.

Syntax

Global finalization Collapse Source Copy Copied! xlio_exit();





An XLIO polling group is a collection of XLIO sockets and their internal auxiliary objects. An XLIO polling group is represented by the opaque xlio_poll_group_t type.

Note Polling groups do not share objects. Thus, object migration between groups is not supported.

Operations with different polling groups do not overlap, except for unlikely protected access to global pools. Therefore, multiple polling groups can work in parallel without serialization.

Operations with the same polling group must be serialized.

Polling groups are not bound to CPU/thread. It is allowed to use a single polling group on multiple CPUs if serialization is guaranteed. For example, this approach can be used for a polling group migration implementation.

Recommendations:

Polling groups are expected to be long-lived objects.

It is expected to use polling group per CPU/thread and probably a small number of extra groups.

Each polling group creates HW objects per utilized network interface. Minimizing the number of utilized network interfaces per group will improve HW resources utilization.

A major part of the XLIO activities is done in the context of xlio_poll_group_poll() call. Therefore, this function should be called frequently enough to reduce latency and avoid runtime issues such as timeouts and TCP retransmissions.

Flags definitions Collapse Source Copy Copied! #define XLIO_GROUP_FLAG_SAFE 0x1 #define XLIO_GROUP_FLAG_DIRTY 0x2

Where

Flag Description XLIO_GROUP_FLAG_SAFE Relaxes thread-safety requirements: allows to call a send operation concurrently with the polling group operations. However, all the group operations and socket creation/destruction still must be serialized. Concurrent send operations still must be serialized. This flag has a runtime cost and is expected to be used for performance non-critical sockets XLIO_GROUP_FLAG_DIRTY Requests the group to track dirty sockets. Required for xlio_poll_group_flush() to function

Types definitions Collapse Source Copy Copied! struct xlio_poll_group_attr { unsigned flags; void (*socket_event_cb)(xlio_socket_t, uintptr_t userdata_sq, int event, int value); void (*socket_comp_cb)(xlio_socket_t, uintptr_t userdata_sq, uintptr_t userdata_op); void (*socket_rx_cb)(xlio_socket_t, uintptr_t userdata_sq, void *data, size_t len, struct xlio_buf *buf); };

Where

Field Description flags Polling group flags socket_event_cb Mandatory callback for socket events socket_comp_cb Completion callback for zerocopy send operations socket_rx_cb Callback for RX data delivery

Syntax

Creating XLIO polling group Collapse Source Copy Copied! int xlio_poll_group_create( const struct xlio_poll_group_attr *attr, xlio_poll_group_t *group_out);

Where

Argument Description attr Polling group attributes group_out On success, the created polling group is saved there

Return value

Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.

Syntax

XLIO polling group destruction Collapse Source Copy Copied! int xlio_poll_group_destroy(xlio_poll_group_t group);

Where

Argument Description group XLIO polling group

Return value

Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.

Syntax

Polling Collapse Source Copy Copied! void xlio_poll_group_poll(xlio_poll_group_t group);

Where

Argument Description group XLIO polling group

XLIO socket is similar to the POSIX socket, except it has a separate non-overlapping API. An XLIO socket is represented by the opaque xlio_socket_t type.

XLIO sockets have the following properties:

Always non-blocking.

No partial write support. Either all the data is accepted, or the call fails.

Types definitions Collapse Source Copy Copied! struct xlio_socket_attr { unsigned flags; int domain; xlio_poll_group_t group; uintptr_t userdata_sq; };

Where

Field Description flags Socket flags, currently unused domain Address family: either AF_INET or AF_INET6 group XLIO polling group userdata_sq Opaque per-socket userdata

Syntax

XLIO socket creation Collapse Source Copy Copied! int xlio_socket_create( const struct xlio_socket_attr *attr, xlio_socket_t *sock_out);

Where

Argument Description attr Socket attributes sock_out On success, the created socket object is saved there

Return value

Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.

Syntax

XLIO socket destruction Collapse Source Copy Copied! int xlio_socket_destroy(xlio_socket_t sock);

Where

Argument Description sock XLIO socket object.

Return value

Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.

Syntax

Connect XLIO socket Collapse Source Copy Copied! int xlio_socket_connect(xlio_socket_t sock, const struct sockaddr *to, socklen_t tolen);

Where

Argument Description sock XLIO socket object to Remote address to connect to tolen Length of the address object

Return value

Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error. Asynchronous connect is a success, therefore, EINPROGRESS and EAGAIN errors are not possible. The result of an asynchronous connect is delivered with the socket event callback. Subsequent xlio_socket_connect() calls are ignored and their return code is undefined.

xlio_socket_setsockopt() and xlio_socket_bind() duplicate setsockopt(2) and bind(2) functionality respectively.

Syntax

setsockopt and bind Collapse Source Copy Copied! int xlio_socket_setsockopt(xlio_socket_t sock, int level, int optname, const void *optval, socklen_t optlen); int xlio_socket_bind(xlio_socket_t sock, const struct sockaddr *addr, socklen_t addrlen);

XLIO exposes protection domain as ibv_pd object. The protection domain is related to the outgoing device used by the socket. It is expected to have a protection domain per outgoing interface and, as a result, sockets can share the same object depending on the remote IP address configuration.

xlio_socket_pd() should be called after XLIO determines the outgoing device for the socket, which happens in the context of xlio_socket_connect().

The main purpose of the exposed protection domain is to perform memory registration for user’s TX data buffers which will be used in the TX zerocopy path. See ibv_reg_mr(3) and “TX Data Path” for details. See XLIO Socket sample application for an example.

Syntax

Protection domain Collapse Source Copy Copied! struct ibv_pd *xlio_socket_get_pd(xlio_socket_t sock);

Where

Argument Description sock XLIO socket object

Return value

Returns protection domain for the socket on success. On error, NULL is returned. The function fails util xlio_socket_connect() is called for the respective socket.

The socket event callback is configured per polling group with xlio_poll_group_attr:: socket_event_cb().

Most of the socket events are delivered from xlio_poll_group_poll() context, except for XLIO_SOCKET_EVENT_TERMINATED, which can be triggered from the xlio_socket_destroy() context.

Socket event callback applies the following restrictions on the socket operations:

Send operations are allowed only while processing XLIO_SOCKET_EVENT_ESTABLISHED event.

xlio_socket_destroy() is not allowed.

Send operation in the callback is allowed only for the XLIO_SOCKET_EVENT_ESTABLISHED event.

Syntax

Socket event callback Collapse Source Copy Copied! enum { XLIO_SOCKET_EVENT_ESTABLISHED = 1, XLIO_SOCKET_EVENT_TERMINATED, XLIO_SOCKET_EVENT_CLOSED, XLIO_SOCKET_EVENT_ERROR, }; typedef void (*xlio_socket_event_cb_t)(xlio_socket_t sock, uintptr_t userdata_sq, int event, int value);

Where

Argument Description sock XLIO socket object userdata_sq Opaque user data which is defined during socket creation event Represents the event value Holds a POSIX error code for the XLIO_SOCKET_EVENT_ERROR event. Should be ignored for other events

Possible error codes for the XLIO_SOCKET_EVENT_ERROR event:

ECONNABORTED - connection aborted by the local side

ECONNRESET - connection reset by the remote side

ECONNREFUSED - connection refused by the remote side during TCP handshake

ETIMEDOUT - connection timed out due to keepalive, user timeout option or TCP handshake timeout

TX path performs data aggregation until user requests a flush. This allows to avoid data aggregation on the user level and explicitly control sending of more optimal big packets. There are 3 ways to flush sockets:

Polling group level flush with xlio_poll_group_flush()

Socket level flush with xlio_socket_flush()

Socket level flush with XLIO_ SEND_FLAG_FLUSH flag in a send operation

It is recommended to use only group level flush for polling groups with XLIO_GROUP_FLAG_DIRTY flag. And use socket level flush for sockets from a group without the flag.

Nagle algorithm remains effective for the XLIO sockets, however, it is recommended to use explicit flush mechanism and disable Nagle algorithm with either TCP_NODELAY option or XLIO_TCP_NODELAY parameter.

By default, send operations are zerocopy. The memory with data must be registered in advance in the XLIO protection domain. See xlio_socket_get_pd() and ibv_reg_mr(3).

XLIO_SEND_FLAG_INLINE flag forces XLIO to copy data to its internal buffers. An inline send operation does not take ownership on the data memory and the respective buffers may be reused immediately after the call returns. Such an operation ignores xlio_send_attr::mkey and xlio_send_attr::userdata_op fields.

There is no partial write, and the TCP send buffer option does not affect the XLIO sockets. XLIO either queues all the data or returns an error. Errors are not recoverable.

Flags definitions Collapse Source Copy Copied! #define XLIO_SEND_FLAG_FLUSH 0x1 #define XLIO_SEND_FLAG_INLINE 0x2

Where

Flag Description XLIO_SEND_FLAG_FLUSH Flush all aggregated data as part of the send operation XLIO_SENF_FLAG_INLINE Force XLIO to copy the data to its internal buffers

Types definitions Collapse Source Copy Copied! struct xlio_send_attr { unsigned flags; uint32_t mkey; uintptr_t userdata_op; };

Where

Field Description flags Force XLIO to copy the data to its internal buffers mkey Memory registration key (e.g. obtained via ibv_reg_mr(3)) userdata_op Opaque per-operation userdata

Syntax

Send operation Collapse Source Copy Copied! int xlio_socket_send(xlio_socket_t sock, void *data, size_t len, struct xlio_send_attr *attr); int xlio_socket_sendv(xlio_socket_t sock, struct iovec *iov, unsigned iovcnt, struct xlio_send_attr *attr);

Where

Argument Description sock XLIO socket object data User pointer to the data to send len Length of the data attr Send operation attributes iov Vectorized data to send iovcnt Number of scatter-gather elements in the iov vector

Return value

Returns 0 on success. On error, -1 is returned, and errno is set to indicate the error.

Syntax

Flush operation Collapse Source Copy Copied! void xlio_socket_flush(xlio_socket_t sock); void xlio_poll_group_flush(xlio_poll_group_t group);

Where

Argument Description sock XLIO socket object. group XLIO polling group object.

User can request a completion on individual zerocopy send operations. A completion is requested with a non-zero xlio_send_attr::userdata_op value. Zero value in xlio_send_attr::userdata_op disables the completion for the operation. With the completion, XLIO guarantees the following:

The respective data is delivered to the remote side

The data is acknowledged by the TCP protocol

The memory buffer is not used by XLIO

A completion is generated for an operation rather than a buffer. On a completion, user may reuse the respective memory buffers.

XLIO does not guarantee order of the completions. However, completions are likely generated in the same order as their respective send operations.

User may provide duplicate xlio_send_attr::userdata_op value in multiple send operations and XLIO generates multiple completions with duplicated userdata_op argument respectively.

XLIO_SEND_FLAG_INLINE send operations do not generate completions.

Syntax

Zerocopy completion callback Collapse Source Copy Copied! void (*socket_comp_cb)(xlio_socket_t sock, uintptr_t userdata_sq, uintptr_t userdata_op);

Where

Argument Description sock XLIO socket object. userdata_sq Opaque per-socket userdata. userdata_op Opaque per-operation userdata.

RX payload is delivered with the RX callback and treated as an RX event. There is no data aggregation on the socket layer and data is delivered immediately. However, orthogonal features LRO and GRO can perform aggregation on the lower layers, which can affect latency and data granularity.

RX path is always zerocopy – XLIO provides a pointer to its internal buffer, which is in the user address space. Once the RX buffer is handled, the user is responsible to return the buffer back to XLIO.

The user can use an external allocator and/or notification about RX buffers memory allocation to control the memory area, which is used in the RX path. See “Global initialization” section above for details. If needed, the memory area may be prepared in advance for further handling by the application (e.g. register memory for RDMA operations).

XLIO provides an xlio_buf metadata object which defines xlio_buf::userdata. The field is of uninitialized 8 bytes that can be used by the user during their ownership on the buffer. The user holds ownership on a buffer starting from a respective RX callback and until the buffer is returned back to XLIO.

Syntax

RX data callback Collapse Source Copy Copied! void (*socket_rx_cb)(xlio_socket_t sock, uintptr_t userdata_sq, void *data, size_t len, struct xlio_buf *buf);

Where

Argument Description sock XLIO socket object userdata_sq Opaque per-socket userdata data Pointer to the payload which points to an XLIO internal buffer len Data length buf A buffer metadata object which must be returned back to XLIO

Syntax

Return RX buffer Collapse Source Copy Copied! void xlio_socket_buf_free(xlio_socket_t sock, struct xlio_buf *buf); void xlio_poll_group_buf_free(xlio_poll_group_t group, struct xlio_buf *buf);

