Direct Packet Control Plane (DPCP)
DPCP provides a unified, flexible interface for programming NVIDIA NICs. DPCP is a prerequisite for enabling XLIO features such as LRO offload, Striding-RQ and TLS HW offload.
XLIO, which comes as part of OFED uses DPCP. Please check the Important Notes under Changes and New Features for the minimal DPCP version required.
DPCP is an open source project - see its repository here.
TLS HW Offload
TLS HW offload feature accelerates TLS encryption/decryption.
- Please refer to System Requirements
- The card must be crypto enabled based on supported cards
- Linux distribution with kTLS support
- Application or TLS library with kTLS support
- OpenSSL library for symmetric encryption SW fallback.
XLIO offloads Linux kTLS API. Refer to Linux documentation for description of the kTLS API: https://www.kernel.org/doc/html/latest/networking/tls.html.
TLS HW offload can be provided to application implicitly by a TLS library with kTLS support such as OpenSSL.
If TLS HW offload cannot be provided setsockopt() syscall returns an error with errno=ENOPROTOOPT. TLS HW offload can be disabled forcibly per direction using configuration options XLIO_UTLS_TX and XLIO_UTLS_RX.
TLS HW offload feature adds new statistics counters. Their presence indicate that offload is configured and works. xlio_stats tool with option -v3 shows TLS statistics for TCP sockets and Rings:
Description of the statistics counters:
- TLS Offload (version) - 0303 for TLS1.2 and 0304 for TLS1.3.
- TLS Offload (cipher) - 51 for AES128-GCM and 52 for AES256-GCM.
- TLS Offload (TX|RX) - On|Off values turn TLS transmit(TX) and receive(RX) On or Off.
- TLS Tx Offload (kilobytes) – number of offloaded kilobytes excluding headers and other TLS record overhead.
- TLS Tx Offload (records) – number of created and queued TLS records.
- TLS Tx Resyncs – number of HW resynchronizations due to out of sequence send operations.
- TLS Rx Offload (kilobytes) - number of bytes received as TLS payload.
- TLS Rx Offload (records) - total number of TLS records received on the socket.
- TLS Rx Offload (encrypted) - number of encrypted TLS records were decrypted in SW by XLIO.
- TLS Rx Offload (mixed) - number of partially decrypted TLS records handled by XLIO.
- TLS Rx Resyncs – number of times HW loses synchronization.
- TLS TX Context Setups – accumulative counter of created TLS TX contexts what equals to the summary number of sockets with configured TLS TX offload.
- TLS RX Context Setups – accumulative counter of created TLS RX contexts what equals to the summary number of sockets with configured TLS RX offload.
Tuning of XLIO for TLS HW Offload in DNS-over-HTTPS (DoH) scenario
For DNS-over-HTTPS (DoH) scenario there are specific profiles that are optimized for the NGINX frontend side. For x86 server we recommend using XLIO_SPEC=nginx. For NVIDIA DPU system we recommend using XLIO_SPEC=nginx_dpu
Tuning of XLIO for TLS HW Offload in Content Delivery Network (CDN) scenario
The basic profile for Content Delivery Network (CDN) scenario is XLIO_SPEC=nginx.
In the CDN scenario TLS payload often exceeds MTU size. In this case, it is recommended to increase TX buffer size. With larger TX buffers XLIO can create more optimal TLS records.
However, this change may require an increasing number of hugepages configured in the system.
The below table lists all the supported offloaded ciphers.
|TLS Version||Bits||Hardware Offload||OpenSSL Name||XLIO Support|
- Not supported by RHEL v8.3 yet.
Precision Time Protocol (PTP)
XLIO supports hardware timestamping for UDP-RX flow (only) with Precision Time Protocol (PTP).
When using XLIO on a server running a PTP daemon, XLIO can periodically query the kernel to obtain updated time conversion parameters which it uses in conjunction with the hardware time-stamp it receives from the NIC to provide synchronized time.
- Support devices: NIC clock
- Set XLIO_HW_TS_CONVERSION environment variable to 4
Set the SO_TIMESTAMPING option for the socket with value SOF_TIMESTAMPING_RX_HARDWARE:
Set XLIO environment parameter XLIO_HW_TS_CONVERSION to 4.
Use the Linux kernel (v4.11) timestamping example found in the kernel source at: tools/testing/selftests/networking/timestamping/timestamping.c.
Each PCI transaction between the system’s RAM and NIC starts at ~300 nsec (and increasing depended on buffer size). Application egress latency can be improved by reducing as many PCI transition as possible on the send path.
Today, XLIO achieves these goals by copying the WQE into the doorbell, and for small packets (<190 Bytes payload) XLIO can inline the packet into the WQE and reduce the data gather PCI transition as well. For data sizes above 190 bytes, an additional PCI gather cycle by the NIC is required to pull the data buffer for egress.
XLIO uses the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. The on-device-memory is a resource managed by XLIO and it is transparent to the user. The total size of the on-device-memory is limited to 256k for a single-port NIC and to 128k for dual-port NIC. Using XLIO_RING_DEV_MEM_TX, the user can set the amount of on-device-memory buffer allocated for each TX ring.
- Driver: MLNX_OFED version 23.07 and above
- NIC: NVIDIA ConnectX®-6 Dx/ConnectX-7/BlueField-2/BlueField-3 and above
- Protocol: Ethernet.
- Set XLIO_RING_DEV_MEM_TX environment variable to best suit the application's requirements
Verifying On-Device Memory Capability in the Hardware
To verify “On Device Memory” capability in the hardware, run XLIO with DEBUG trace level:
Look in the printout for a positive value of on-device-memory bytes.
To show and monitor On-Device Memory statistics, run xlio_stats tool.
In order to enable TCP_QUICKACK threshold, the user should modify TCP_QUICKACK_THRESHOLD parameter in the lwip/opt.h file and recompile XLIO.
While TCP_QUICKACK option is enabled, TCP acknowledgments are sent immediately, rather than being delayed in accordance to a normal TCP receive operation. However, sending the TCP acknowledge delays the incoming packet processing to after the acknowledgement has been completed which can affect performance.
TCP_QUICKACK threshold enables the user to disable the quick acknowledgement for payloads that are larger than the threshold. The threshold is effective only when TCP_QUICKACK is enabled, using setsockopt() or using XLIO_TCP_QUICKACK parameter. TCP_QUICKACK threshold is disabled by default.
XLIO Daemon Design
XLIO daemon is responsible for managing all traffic control logic of all XLIO processes, including qdisc, u32 table hashing, adding filters, removing filters, removing filters when the application crashes.
For XLIO daemon usage instructions, refer to the Installing the XLIO Binary Package section in the Installation Guide.
To show and monitor TAP statistics, run the xlio_stats tool:
- RING_TAP and RING_ETH have the same master 0x29e4260 ring
- 4463 Kbytes/67209 packets were sent from the TAP device
- 5977 Kbytes/90013 packets were received from the TAP device
- Plugout event occurred once
- TAP device fd number was 21, TAP name was td34f15
Getting Ring Protection Domain
Pass special structure as an argument into getsockopt() with SO_XLIO_PD to get protection domain information from ring used for current socket. This information can be available after setting connection for TX ring and bounding to device for RX ring. By default getting PD for TX ring. This case can be used with sendmsg(SCM_XLIO_PD) when the data portion contains an array of the elements with datatype as struct xlio_pd_key. Number of elements in this array should be equal to msg_iovlen value. Every data pointer in msg_iov has correspondent memory key.
NVME over TCP DIGEST Offload Tx (Alpha Level)
NVME over TCP (NVMEoTCP) hardware offload feature accelerates NVMEoTCP DIGEST calculation for transmitted NVME PDUs.
- Please refer to System Requirements
- The card must support NVMEoTCP offload (ConnectX-7, Bluefield-3)
- Use the application to call
setsockopt (fd, IPPROTO_TCP, TCP_ULP, "nvme", 4)
setsockopt(fd, NVDA_NVME, NVME_TX, &configure, sizeof(configure))where: uint32_t configure = XLIO_NVME_HDGST_ENABLE | XLIO_NVME_DDGST_ENABLE | XLIO_NVME_DDGST_OFFLOAD
Note: If any of the setsockopt calls fail, offload is not supported.
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &opt_val, sizeof(opt_val))
where: int opt_val = 1
- Use the application to register TX buffers with PD from XLIO - see section Getting Ring Protection Domain section above
sendmsg(fd, msg, MSG_ZEROCOPY)with extended zero-copy API
where msg is of type msghdr
- With cmsghdr *cmsg = CMSG_FIRSTHDR(msg);
- cmsg->cmsg_level = SOL_SOCKET;
- cmsg->cmsg_type = SCM_XLIO_NVME_PD;
- cmsg->cmsg_len = msg->msg_controllen;
- With CMSG_DATA(cmsg) set to xlio_pd_key
See Kernel TX Zero Copy documentation: https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html
Full examples can be found in XLIO GIT repository: https://github.com/Mellanox/libxlio