Advanced Features
DPCP provides a unified, flexible interface for programming NVIDIA NICs. DPCP is a prerequisite for enabling XLIO features such as LRO offload, Striding-RQ and TLS HW offload.
XLIO, which comes as part of DOCA-Host, uses DPCP. Please check the Important Notes under Changes and New Features for the minimal DPCP version required.
DPCP is an open source project - see its repository here.
TLS HW offload feature accelerates TLS encryption/decryption.
Prerequisites
Please refer to System Requirements
The card must be crypto enabled based on supported cards
Linux distribution with kTLS support
To use LIBXLIO with TLS 1.3 hardware offload, your OS must support kTLS 1.3.
Application or TLS library with kTLS support
OpenSSL
OpenSSL library for symmetric encryption SW fallback.
XLIO TLS1.2/1.3 Tx offload requires OpenSSL ≥ 3.0.0
XLIO TLS1.2 Rx offload requires OpenSSL ≥ 3.0.2
XLIO TLS1.3 Rx offload requires OpenSSL ≥ 3.2.0
Set
enable-ktls
in the configuration when building OpenSSL.
Usage
Enable kTLS in your application the same way you would when relying on kernel TLS support.
XLIO offloads Linux kTLS API. Refer to Linux documentation for description of the kTLS API: https://www.kernel.org/doc/html/latest/networking/tls.html.
TLS HW offload can be provided to application implicitly by a TLS library with kTLS support such as OpenSSL.
XLIO provides its own configuration parameters to control kTLS offload: XLIO_UTLS_TX (enabled by default) and XLIO_UTLS_RX (disabled by default).
Ensure to enable XLIO_UTLS_RX if you need KTLS on RX as well.
Note: If TLS HW offload cannot be provided setsockopt() syscall returns an error with errno=ENOPROTOOPT.
Monitoring
TLS HW offload feature adds new statistics counters. Their presence indicate that offload is configured and works. xlio_stats tool with option -v3 shows TLS statistics for TCP sockets and Rings:
======================================================
Fd=[59
]
- TCP, Non-blocked
- Local Address = [14.212
.1.34
:443
]
- Foreign Address = [14.212
.1.57
:49072
]
Tx Offload: 18511
/ 39409
/ 0
/ 0
[kilobytes/packets/eagains/errors]
Rx Offload: 1045354
/ 2210387
/ 0
/ 1
[kilobytes/packets/eagains/errors]
Rx byte
: cur 0
/ max 313
/ dropped 0
/ limit 0
Rx pkt : cur 0
/ max 1
/ dropped 0
TLS Offload: version 0303
/ cipher 51
/ TX On / RX On
TLS Tx Offload: 17394
/ 39407
[kilobytes/records]
TLS Rx Offload: 982755
/ 2210381
/ 28
/ 0
[kilobytes/records/encrypted/mixed]
TLS Rx Resyncs: 1
[total]
======================================================
RING_ETH=[0
]
Tx Offload: 18519
/ 39559
[kilobytes/packets]
Rx Offload: 5080
/ 39419
[kilobytes/packets]
TLS TX Context Setups: 1
TLS RX Context Setups: 1
Interrupts: 39324
/ 38656
[requests/received]
Moderation: 1024
/ 1024
[frames/usec period]
======================================================
Description of the statistics counters:
TLS Offload (version) - 0303 for TLS1.2 and 0304 for TLS1.3.
TLS Offload (cipher) - 51 for AES128-GCM and 52 for AES256-GCM.
TLS Offload (TX|RX) - On|Off values turn TLS transmit(TX) and receive(RX) On or Off.
TLS Tx Offload (kilobytes) – number of offloaded kilobytes excluding headers and other TLS record overhead.
TLS Tx Offload (records) – number of created and queued TLS records.
TLS Tx Resyncs – number of HW resynchronizations due to out of sequence send operations.
TLS Rx Offload (kilobytes) - number of bytes received as TLS payload.
TLS Rx Offload (records) - total number of TLS records received on the socket.
TLS Rx Offload (encrypted) - number of encrypted TLS records were decrypted in SW by XLIO.
TLS Rx Offload (mixed) - number of partially decrypted TLS records handled by XLIO.
TLS Rx Resyncs – number of times HW loses synchronization.
TLS TX Context Setups – accumulative counter of created TLS TX contexts what equals to the summary number of sockets with configured TLS TX offload.
TLS RX Context Setups – accumulative counter of created TLS RX contexts what equals to the summary number of sockets with configured TLS RX offload.
Note: TLS kernel counters do not increment when the application is offloaded with LIBXLIO
Supported Ciphers
The below table lists all the supported offloaded ciphers.
TLS Version | Bits | Hardware Offload | OpenSSL Name | XLIO Support | |
TX | RX | ||||
1.2 | 128 | TLS1.2-AES128-GCM | AES128-GCM-SHA256 | YES | YES |
ECDHE-ECDSA-AES128-GCM-SHA256 | YES | YES | |||
ECDHE-RSA-AES128-GCM-SHA256 | YES | YES | |||
256 | TLS1.2-AES256-GCM | AES256-GCM-SHA384 | YES1 | YES1 | |
ECDHE-ECDSA-AES256-GCM-SHA384 | YES1 | YES1 | |||
ECDHE-RSA-AES256-GCM-SHA384 | YES1 | YES1 | |||
1.3 | 128 | TLS1.3-AES128-GCM | TLS_AES_128_GCM_SHA256 | YES1 | YES1 |
256 | TLS1.3-AES256-GCM | TLS_AES_256_GCM_SHA384 | YES1 | YES1 |
Not supported by RHEL v8.3 yet.
XLIO supports hardware timestamping for UDP-RX flow (only) with Precision Time Protocol (PTP).
When using XLIO on a server running a PTP daemon, XLIO can periodically query the kernel to obtain updated time conversion parameters which it uses in conjunction with the hardware time-stamp it receives from the NIC to provide synchronized time.
Prerequisites
Support devices: NIC clock
Set XLIO_HW_TS_CONVERSION environment variable to 4
Usage
Set the SO_TIMESTAMPING option for the socket with value SOF_TIMESTAMPING_RX_HARDWARE:
uint8_t val = SOF_TIMESTAMPING_RX_HARDWARE setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
Set XLIO environment parameter XLIO_HW_TS_CONVERSION to 4.
Example:
Use the Linux kernel (v4.11) timestamping example found in the kernel source at: tools/testing/selftests/networking/timestamping/timestamping.c.
Serve $ sudo LD_PRELOAD=libxlio.so XLIO_HW_TS_CONVERSION=
4
./timestamping <iface> SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_RX_HARDWARE Client $ LD_PRELOAD=libxlio.so sockperf tp -i <server-ip> -t3600
-p6666
--mps10
timestamping output: SOL_SOCKET SO_TIMESTAMPING SW0.000000000
HW raw1497823023.070846953
IP_PKTINFOinterface
index8
SOL_SOCKET SO_TIMESTAMPING SW0.000000000
HW raw1497823023.170847260
IP_PKTINFOinterface
index8
SOL_SOCKET SO_TIMESTAMPING SW0.000000000
HW raw1497823023.270847093
IP_PKTINFOinterface
index8
Each PCI transaction between the system’s RAM and NIC starts at ~300 nsec and increases depending on the buffer size. Application egress latency can be improved by reducing the number of PCI transitions on the send path as much as possible.
Today, XLIO achieves these goals by copying the WQE into the doorbell, and for small packets (<190 Bytes payload) XLIO can inline the packet into the WQE and reduce the data gather PCI transition as well. For data sizes above 190 bytes, an additional PCI gather cycle by the NIC is required to pull the data buffer for egress.
XLIO uses the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. The on-device-memory is a resource managed by XLIO and it is transparent to the user. The total size of the on-device-memory is limited to 256k for a single-port NIC and to 128k for dual-port NIC. Using XLIO_RING_DEV_MEM_TX, the user can set the amount of on-device-memory buffer allocated for each TX ring.
Prerequisites
NIC: NVIDIA ConnectX®-6 Dx/ConnectX-7/BlueField-2/BlueField-3 and above
Protocol: Ethernet
Set XLIO_RING_DEV_MEM_TX environment variable to best suit the application's requirements
Verifying On-Device Memory Capability in the Hardware
To verify “On Device Memory” capability in the hardware, run XLIO with DEBUG trace level:
XLIO_TRACELEVEL=DEBUG LD_PRELOAD=<path to libxlio.so> <command line>
Look in the printout for a positive value of on-device-memory bytes.
For example:
Pid: 1748924
Tid: 1748924
XLIO DEBUG : ibch[0x5633333c62f0
]:229
:print_val() mlx5_2: port(s): 1
vendor: 4125
fw: 22.31
.1034
max_qp_wr: 32768
on_device_memory: 131072
packet_pacing_caps: min rate 1
, max rate 100000000
Pid: 1748924
Tid: 1748924
XLIO DEBUG : ibch[0x56333340fa60
]:229
:print_val() mlx5_3: port(s): 1
vendor: 4125
fw: 22.31
.1034
max_qp_wr: 32768
on_device_memory: 131072
packet_pacing_caps: min rate 1
, max rate 100000000
To show and monitor On-Device Memory statistics, run xlio_stats tool.
xlio_stats –p <pid> -v 3
For example:
======================================================
RING_ETH=[0
]
Tx Offload: 858931
/ 3402875
[kilobytes/packets]
Rx Offload: 865251
/ 3402874
[kilobytes/packets]
Dev Mem Alloc: 16384
Dev Mem Stats: 739074
/ 1784935
/ 0
[kilobytes/packets/oob]
======================================================
In order to enable TCP_QUICKACK threshold, the user should modify TCP_QUICKACK_THRESHOLD parameter in the lwip/opt.h file and recompile XLIO.
While TCP_QUICKACK option is enabled, TCP acknowledgments are sent immediately, rather than being delayed in accordance to a normal TCP receive operation. However, sending the TCP acknowledge delays the incoming packet processing to after the acknowledgement has been completed which can affect performance.
TCP_QUICKACK threshold enables the user to disable the quick acknowledgement for payloads that are larger than the threshold. The threshold is effective only when TCP_QUICKACK is enabled, using setsockopt() or using XLIO_TCP_QUICKACK parameter. TCP_QUICKACK threshold is disabled by default.
XLIO daemon is responsible for managing all traffic control logic of all XLIO processes, including qdisc, u32 table hashing, adding filters, removing filters, removing filters when the application crashes.
For XLIO daemon usage instructions, refer to the Installing the XLIO Binary Package section in the Installation Guide.
To show and monitor TAP statistics, run the xlio_stats tool:
xlio_stats –p <pid> -v 3
Example:
======================================================
RING_TAP=[0
]
Master: 0x29e4260
Tx Offload: 4463
/ 67209
[kilobytes/packets]
Rx Offload: 5977
/ 90013
[kilobytes/packets]
Rx Buffers: 256
VF Plugouts: 1
Tap fd: 21
Tap Device: td34f15
======================================================
RING_ETH=[1
]
Master: 0x29e4260
Tx Offload: 7527
/ 113349
[kilobytes/packets]
Rx Offload: 7527
/ 113349
[kilobytes/packets]
Retransmissions: 1
======================================================
Output analysis:
RING_TAP[0] and RING_ETH[1] have the same master 0x29e4260 ring
4463 Kbytes/67209 packets were sent from the TAP device
5977 Kbytes/90013 packets were received from the TAP device
Plugout event occurred once
TAP device fd number was 21, TAP name was td34f15
Pass special structure as an argument into getsockopt() with SO_XLIO_PD to get protection domain information from ring used for current socket. This information can be available after setting connection for TX ring and bounding to device for RX ring. By default getting PD for TX ring. This case can be used with sendmsg(SCM_XLIO_PD) when the data portion contains an array of the elements with datatype as struct xlio_pd_key. Number of elements in this array should be equal to msg_iovlen value. Every data pointer in msg_iov has correspondent memory key.
struct xlio_pd_attr {
uint32_t flags;
void
* ib_pd;
};
NVME over TCP (NVMEoTCP) hardware offload feature accelerates NVMEoTCP DIGEST calculation for transmitted NVME PDUs.
Prerequisites
Please refer to System Requirements
The card must support NVMEoTCP offload (ConnectX-7, Bluefield-3)
Usage
Use the application to call
setsockopt (fd, IPPROTO_TCP, TCP_ULP, "nvme", 4)
Call:
setsockopt(fd, NVDA_NVME, NVME_TX, &configure, sizeof(configure))
where: uint32_t configure = XLIO_NVME_HDGST_ENABLE | XLIO_NVME_DDGST_ENABLE | XLIO_NVME_DDGST_OFFLOADNote: If any of the setsockopt calls fail, offload is not supported.
Call:
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &opt_val, sizeof(opt_val))
where: int opt_val = 1
Use the application to register TX buffers with PD from XLIO - see section Getting Ring Protection Domain section above
Call
sendmsg(fd, msg, MSG_ZEROCOPY)
with extended zero-copy APIwhere msg is of type msghdr
With cmsghdr *cmsg = CMSG_FIRSTHDR(msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_XLIO_NVME_PD;
cmsg->cmsg_len = msg->msg_controllen;
With CMSG_DATA(cmsg) set to xlio_pd_key
See Kernel TX Zero Copy documentation: https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html
Full examples can be found in XLIO GIT repository: https://github.com/Mellanox/libxlio
Offloaded TCP sockets support SO_XLIO_ISOLATE option on SOL_SOCKET level. The option allows grouping sockets with specific policy. Value for the option has type ‘int’ and contains the policy.
Supported policies:
SO_XLIO_ISOLATE_DEFAULT – default behavior according to XLIO configuration.
SO_XLIO_ISOLATE_SAFE – isolate sockets from the default sockets and guarantee thread safety regardless of XLIO configuration. This policy is effective in XLIO_TCP_CTL_THREAD=delegate configuration. Socket API thread safety model is not changed.
Limitations:
SO_XLIO_ISOLATE option may be called after socket() syscall and before either listen() or connect().
Accelerate and enhance the performance of your NGINX webserver using NVIDIA Accelerated IO (XLIO).
XLIO optimizes data transfers and significantly reduces latency by leveraging advanced hardware acceleration capabilities.
Prerequisites
for KTLS usage, please check out “Advanced Features” →” TLS HW Offload” →” Supported Ciphers”.
Limitations
XLIO does not support running in daemon mode. Ensure
daemon off;
remains set.
Nginx best practices
For best practices guidelines, see Appendix: NGINX
Usage
NGINX Configuration
Ensure these settings in your global configuration block (nginx.conf
):
worker_processes <NUM-WORKERS>; # this
directive needs to be coherent with XLIO_NGINX_WORKERS_NUM
daemon off; # XLIO doesn't support daemon on. Keep it off.
XLIO configuration
XLIO_NGINX_WORKERS_NUM=<NUM-WORKERS> should be coherent with worker_processes in nginx.conf.
XLIO_SPEC=<SPEC>
For X86 Platforms - XLIO_SPEC=nginx
for BlueField DPU - XLIO_spec=nginx_dpu