Advanced Features
DPCP provides a unified, flexible interface for programming NVIDIA NICs. DPCP is a prerequisite for enabling XLIO features such as LRO offload, Striding-RQ and TLS HW offload.
XLIO, which comes as part of OFED uses DPCP. Please check the Important Notes under Changes and New Features for the minimal DPCP version required.
DPCP is an open source project - see its repository here.
TLS HW offload feature accelerates TLS encryption/decryption.
Prerequisites
Please refer to System Requirements
The card must be crypto enabled based on supported cards
Linux distribution with kTLS support
Application or TLS library with kTLS support
OpenSSL library for symmetric encryption SW fallback.
Usage
XLIO offloads Linux kTLS API. Refer to Linux documentation for description of the kTLS API: https://www.kernel.org/doc/html/latest/networking/tls.html.
TLS HW offload can be provided to application implicitly by a TLS library with kTLS support such as OpenSSL.
If TLS HW offload cannot be provided setsockopt() syscall returns an error with errno=ENOPROTOOPT. TLS HW offload can be disabled forcibly per direction using configuration options XLIO_UTLS_TX and XLIO_UTLS_RX.
TLS HW offload feature adds new statistics counters. Their presence indicate that offload is configured and works. xlio_stats tool with option -v3 shows TLS statistics for TCP sockets and Rings:
======================================================
Fd=[59
]
- TCP, Non-blocked
- Local Address = [14.212
.1.34
:443
]
- Foreign Address = [14.212
.1.57
:49072
]
Tx Offload: 18511
/ 39409
/ 0
/ 0
[kilobytes/packets/eagains/errors]
Rx Offload: 1045354
/ 2210387
/ 0
/ 1
[kilobytes/packets/eagains/errors]
Rx byte
: cur 0
/ max 313
/ dropped 0
/ limit 0
Rx pkt : cur 0
/ max 1
/ dropped 0
TLS Offload: version 0303
/ cipher 51
/ TX On / RX On
TLS Tx Offload: 17394
/ 39407
[kilobytes/records]
TLS Rx Offload: 982755
/ 2210381
/ 28
/ 0
[kilobytes/records/encrypted/mixed]
TLS Rx Resyncs: 1
[total]
======================================================
RING_ETH=[0
]
Tx Offload: 18519
/ 39559
[kilobytes/packets]
Rx Offload: 5080
/ 39419
[kilobytes/packets]
TLS TX Context Setups: 1
TLS RX Context Setups: 1
Interrupts: 39324
/ 38656
[requests/received]
Moderation: 1024
/ 1024
[frames/usec period]
======================================================
Description of the statistics counters:
TLS Offload (version) - 0303 for TLS1.2 and 0304 for TLS1.3.
TLS Offload (cipher) - 51 for AES128-GCM and 52 for AES256-GCM.
TLS Offload (TX|RX) - On|Off values turn TLS transmit(TX) and receive(RX) On or Off.
TLS Tx Offload (kilobytes) – number of offloaded kilobytes excluding headers and other TLS record overhead.
TLS Tx Offload (records) – number of created and queued TLS records.
TLS Tx Resyncs – number of HW resynchronizations due to out of sequence send operations.
TLS Rx Offload (kilobytes) - number of bytes received as TLS payload.
TLS Rx Offload (records) - total number of TLS records received on the socket.
TLS Rx Offload (encrypted) - number of encrypted TLS records were decrypted in SW by XLIO.
TLS Rx Offload (mixed) - number of partially decrypted TLS records handled by XLIO.
TLS Rx Resyncs – number of times HW loses synchronization.
TLS TX Context Setups – accumulative counter of created TLS TX contexts what equals to the summary number of sockets with configured TLS TX offload.
TLS RX Context Setups – accumulative counter of created TLS RX contexts what equals to the summary number of sockets with configured TLS RX offload.
Tuning of XLIO for TLS HW Offload in DNS-over-HTTPS (DoH) scenario
For DNS-over-HTTPS (DoH) scenario there are specific profiles that are optimized for the NGINX frontend side. For x86 server we recommend using XLIO_SPEC=nginx. For NVIDIA DPU system we recommend using XLIO_SPEC=nginx_dpu
Tuning of XLIO for TLS HW Offload in Content Delivery Network (CDN) scenario
The basic profile for Content Delivery Network (CDN) scenario is XLIO_SPEC=nginx.
In the CDN scenario TLS payload often exceeds MTU size. In this case, it is recommended to increase TX buffer size. With larger TX buffers XLIO can create more optimal TLS records.
XLIO_TX_BUF_SIZE=16384
However, this change may require an increasing number of hugepages configured in the system.
Supported Ciphers
The below table lists all the supported offloaded ciphers.
TLS Version |
Bits |
Hardware Offload |
OpenSSL Name |
XLIO Support |
|
TX |
RX |
||||
1.2 |
128 |
TLS1.2-AES128-GCM |
AES128-GCM-SHA256 |
YES |
YES |
ECDHE-ECDSA-AES128-GCM-SHA256 |
YES |
YES |
|||
ECDHE-RSA-AES128-GCM-SHA256 |
YES |
YES |
|||
256 |
TLS1.2-AES256-GCM |
AES256-GCM-SHA384 |
YES1 |
YES1 |
|
ECDHE-ECDSA-AES256-GCM-SHA384 |
YES1 |
YES1 |
|||
ECDHE-RSA-AES256-GCM-SHA384 |
YES1 |
YES1 |
|||
1.3 |
128 |
TLS1.3-AES128-GCM |
TLS_AES_128_GCM_SHA256 |
YES1 |
YES1 |
256 |
TLS1.3-AES256-GCM |
TLS_AES_256_GCM_SHA384 |
YES1 |
YES1 |
Not supported by RHEL v8.3 yet.
XLIO supports hardware timestamping for UDP-RX flow (only) with Precision Time Protocol (PTP).
When using XLIO on a server running a PTP daemon, XLIO can periodically query the kernel to obtain updated time conversion parameters which it uses in conjunction with the hardware time-stamp it receives from the NIC to provide synchronized time.
Prerequisites
Support devices: NIC clock
Set XLIO_HW_TS_CONVERSION environment variable to 4
Usage
Set the SO_TIMESTAMPING option for the socket with value SOF_TIMESTAMPING_RX_HARDWARE:
uint8_t val = SOF_TIMESTAMPING_RX_HARDWARE setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
Set XLIO environment parameter XLIO_HW_TS_CONVERSION to 4.
Example:
Use the Linux kernel (v4.11) timestamping example found in the kernel source at: tools/testing/selftests/networking/timestamping/timestamping.c.Serve $ sudo LD_PRELOAD=libxlio.so XLIO_HW_TS_CONVERSION=
4
./timestamping <iface> SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_RX_HARDWARE Client $ LD_PRELOAD=libxlio.so sockperf tp -i <server-ip> -t3600
-p6666
--mps10
timestamping output: SOL_SOCKET SO_TIMESTAMPING SW0.000000000
HW raw1497823023.070846953
IP_PKTINFOinterface
index8
SOL_SOCKET SO_TIMESTAMPING SW0.000000000
HW raw1497823023.170847260
IP_PKTINFOinterface
index8
SOL_SOCKET SO_TIMESTAMPING SW0.000000000
HW raw1497823023.270847093
IP_PKTINFOinterface
index8
Each PCI transaction between the system’s RAM and NIC starts at ~300 nsec (and increasing depended on buffer size). Application egress latency can be improved by reducing as many PCI transition as possible on the send path.
Today, XLIO achieves these goals by copying the WQE into the doorbell, and for small packets (<190 Bytes payload) XLIO can inline the packet into the WQE and reduce the data gather PCI transition as well. For data sizes above 190 bytes, an additional PCI gather cycle by the NIC is required to pull the data buffer for egress.
XLIO uses the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. The on-device-memory is a resource managed by XLIO and it is transparent to the user. The total size of the on-device-memory is limited to 256k for a single-port NIC and to 128k for dual-port NIC. Using XLIO_RING_DEV_MEM_TX, the user can set the amount of on-device-memory buffer allocated for each TX ring.
Prerequisites
Driver: MLNX_OFED version 23.07 and above
NIC: NVIDIA ConnectX®-6 Dx/ConnectX-7/BlueField-2/BlueField-3 and above
Protocol: Ethernet.
Set XLIO_RING_DEV_MEM_TX environment variable to best suit the application's requirements
Verifying On-Device Memory Capability in the Hardware
To verify “On Device Memory” capability in the hardware, run XLIO with DEBUG trace level:
XLIO_TRACELEVEL=DEBUG LD_PRELOAD=<path to libxlio.so> <command line>
Look in the printout for a positive value of on-device-memory bytes.
For example:
Pid: 1748924
Tid: 1748924
XLIO DEBUG : ibch[0x5633333c62f0
]:229
:print_val() mlx5_2: port(s): 1
vendor: 4125
fw: 22.31
.1034
max_qp_wr: 32768
on_device_memory: 131072
packet_pacing_caps: min rate 1
, max rate 100000000
Pid: 1748924
Tid: 1748924
XLIO DEBUG : ibch[0x56333340fa60
]:229
:print_val() mlx5_3: port(s): 1
vendor: 4125
fw: 22.31
.1034
max_qp_wr: 32768
on_device_memory: 131072
packet_pacing_caps: min rate 1
, max rate 100000000
To show and monitor On-Device Memory statistics, run xlio_stats tool.
xlio_stats –p <pid> -v 3
For example:
======================================================
RING_ETH=[0
]
Tx Offload: 858931
/ 3402875
[kilobytes/packets]
Rx Offload: 865251
/ 3402874
[kilobytes/packets]
Dev Mem Alloc: 16384
Dev Mem Stats: 739074
/ 1784935
/ 0
[kilobytes/packets/oob]
======================================================
In order to enable TCP_QUICKACK threshold, the user should modify TCP_QUICKACK_THRESHOLD parameter in the lwip/opt.h file and recompile XLIO.
While TCP_QUICKACK option is enabled, TCP acknowledgments are sent immediately, rather than being delayed in accordance to a normal TCP receive operation. However, sending the TCP acknowledge delays the incoming packet processing to after the acknowledgement has been completed which can affect performance.
TCP_QUICKACK threshold enables the user to disable the quick acknowledgement for payloads that are larger than the threshold. The threshold is effective only when TCP_QUICKACK is enabled, using setsockopt() or using XLIO_TCP_QUICKACK parameter. TCP_QUICKACK threshold is disabled by default.
XLIO daemon is responsible for managing all traffic control logic of all XLIO processes, including qdisc, u32 table hashing, adding filters, removing filters, removing filters when the application crashes.
For XLIO daemon usage instructions, refer to the Installing the XLIO Binary Package section in the Installation Guide.
To show and monitor TAP statistics, run the xlio_stats tool:
xlio_stats –p <pid> -v 3
Example:
======================================================
RING_TAP=[0
]
Master: 0x29e4260
Tx Offload: 4463
/ 67209
[kilobytes/packets]
Rx Offload: 5977
/ 90013
[kilobytes/packets]
Rx Buffers: 256
VF Plugouts: 1
Tap fd: 21
Tap Device: td34f15
======================================================
RING_ETH=[1
]
Master: 0x29e4260
Tx Offload: 7527
/ 113349
[kilobytes/packets]
Rx Offload: 7527
/ 113349
[kilobytes/packets]
Retransmissions: 1
======================================================
Output analysis:
RING_TAP[0] and RING_ETH[1] have the same master 0x29e4260 ring
4463 Kbytes/67209 packets were sent from the TAP device
5977 Kbytes/90013 packets were received from the TAP device
Plugout event occurred once
TAP device fd number was 21, TAP name was td34f15
Pass special structure as an argument into getsockopt() with SO_XLIO_PD to get protection domain information from ring used for current socket. This information can be available after setting connection for TX ring and bounding to device for RX ring. By default getting PD for TX ring. This case can be used with sendmsg(SCM_XLIO_PD) when the data portion contains an array of the elements with datatype as struct xlio_pd_key. Number of elements in this array should be equal to msg_iovlen value. Every data pointer in msg_iov has correspondent memory key.
struct xlio_pd_attr {
uint32_t flags;
void
* ib_pd;
};
NVME over TCP (NVMEoTCP) hardware offload feature accelerates NVMEoTCP DIGEST calculation for transmitted NVME PDUs.
Prerequisites
Please refer to System Requirements
The card must support NVMEoTCP offload (ConnectX-7, Bluefield-3)
Usage
Use the application to call setsockopt (fd, IPPROTO_TCP, TCP_ULP, "nvme", 4)
Call: setsockopt(fd, NVDA_NVME, NVME_TX, &configure, sizeof(configure))where: uint32_t configure = XLIO_NVME_HDGST_ENABLE | XLIO_NVME_DDGST_ENABLE | XLIO_NVME_DDGST_OFFLOAD
Note: If any of the setsockopt calls fail, offload is not supported.
Call: setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &opt_val, sizeof(opt_val))
where: int opt_val = 1Use the application to register TX buffers with PD from XLIO - see section Getting Ring Protection Domain section above
Call sendmsg(fd, msg, MSG_ZEROCOPY) with extended zero-copy API
where msg is of type msghdrWith cmsghdr *cmsg = CMSG_FIRSTHDR(msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_XLIO_NVME_PD;
cmsg->cmsg_len = msg->msg_controllen;
With CMSG_DATA(cmsg) set to xlio_pd_key
See Kernel TX Zero Copy documentation: https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html
Full examples can be found in XLIO GIT repository: https://github.com/Mellanox/libxlio
Offloaded TCP sockets support SO_XLIO_ISOLATE option on SOL_SOCKET level. The option allows grouping sockets with specific policy. Value for the option has type ‘int’ and contains the policy.
Supported policies:
SO_XLIO_ISOLATE_DEFAULT – default behavior according to XLIO configuration.
SO_XLIO_ISOLATE_SAFE – isolate sockets from the default sockets and guarantee thread safety regardless of XLIO configuration. This policy is effective in XLIO_TCP_CTL_THREAD=delegate configuration. Socket API thread safety model is not changed.
Limitations:
SO_XLIO_ISOLATE option may be called after socket() syscall and before either listen() or connect().