NVIDIA Accelerated IO (XLIO) Documentation Rev 3.20.8 LTS

Advanced Features

DPCP provides a unified, flexible interface for programming NVIDIA NICs. DPCP is a prerequisite for enabling XLIO features such as LRO offload, Striding-RQ and TLS HW offload.

XLIO, which comes as part of OFED uses DPCP. Please check the Important Notes under Changes and New Features for the minimal DPCP version required.

DPCP is an open source project - see its repository here.

TLS HW offload feature accelerates TLS encryption/decryption.

Prerequisites

  • Please refer to System Requirements

  • The card must be crypto enabled based on supported cards

  • Linux distribution with kTLS support

  • Application or TLS library with kTLS support

  • OpenSSL library for symmetric encryption SW fallback.

Usage

XLIO offloads Linux kTLS API. Refer to Linux documentation for description of the kTLS API: https://www.kernel.org/doc/html/latest/networking/tls.html.

TLS HW offload can be provided to application implicitly by a TLS library with kTLS support such as OpenSSL.

If TLS HW offload cannot be provided setsockopt() syscall returns an error with errno=ENOPROTOOPT. TLS HW offload can be disabled forcibly per direction using configuration options XLIO_UTLS_TX and XLIO_UTLS_RX.

TLS HW offload feature adds new statistics counters. Their presence indicate that offload is configured and works. xlio_stats tool with option -v3 shows TLS statistics for TCP sockets and Rings:

Copy
Copied!
            

====================================================== Fd=[59] - TCP, Non-blocked - Local Address = [14.212.1.34:443] - Foreign Address = [14.212.1.57:49072] Tx Offload: 18511 / 39409 / 0 / 0 [kilobytes/packets/eagains/errors] Rx Offload: 1045354 / 2210387 / 0 / 1 [kilobytes/packets/eagains/errors] Rx byte: cur 0 / max 313 / dropped 0 / limit 0 Rx pkt : cur 0 / max 1 / dropped 0 TLS Offload: version 0303 / cipher 51 / TX On / RX On TLS Tx Offload: 17394 / 39407 [kilobytes/records] TLS Rx Offload: 982755 / 2210381 / 28 / 0 [kilobytes/records/encrypted/mixed] TLS Rx Resyncs: 1 [total] ====================================================== RING_ETH=[0] Tx Offload: 18519 / 39559 [kilobytes/packets] Rx Offload: 5080 / 39419 [kilobytes/packets] TLS TX Context Setups: 1 TLS RX Context Setups: 1 Interrupts: 39324 / 38656 [requests/received] Moderation: 1024 / 1024 [frames/usec period] ======================================================

Description of the statistics counters:

  • TLS Offload (version) - 0303 for TLS1.2 and 0304 for TLS1.3.

  • TLS Offload (cipher) - 51 for AES128-GCM and 52 for AES256-GCM.

  • TLS Offload (TX|RX) - On|Off values turn TLS transmit(TX) and receive(RX) On or Off.

  • TLS Tx Offload (kilobytes) – number of offloaded kilobytes excluding headers and other TLS record overhead.

  • TLS Tx Offload (records) – number of created and queued TLS records.

  • TLS Tx Resyncs – number of HW resynchronizations due to out of sequence send operations.

  • TLS Rx Offload (kilobytes) - number of bytes received as TLS payload.

  • TLS Rx Offload (records) - total number of TLS records received on the socket.

  • TLS Rx Offload (encrypted) - number of encrypted TLS records were decrypted in SW by XLIO.

  • TLS Rx Offload (mixed) - number of partially decrypted TLS records handled by XLIO.

  • TLS Rx Resyncs – number of times HW loses synchronization.

  • TLS TX Context Setups – accumulative counter of created TLS TX contexts what equals to the summary number of sockets with configured TLS TX offload.

  • TLS RX Context Setups – accumulative counter of created TLS RX contexts what equals to the summary number of sockets with configured TLS RX offload.

Tuning of XLIO for TLS HW Offload in DNS-over-HTTPS (DoH) scenario

For DNS-over-HTTPS (DoH) scenario there are specific profiles that are optimized for the NGINX frontend side. For x86 server we recommend using XLIO_SPEC=nginx. For NVIDIA DPU system we recommend using XLIO_SPEC=nginx_dpu

Tuning of XLIO for TLS HW Offload in Content Delivery Network (CDN) scenario

The basic profile for Content Delivery Network (CDN) scenario is XLIO_SPEC=nginx.

In the CDN scenario TLS payload often exceeds MTU size. In this case, it is recommended to increase TX buffer size. With larger TX buffers XLIO can create more optimal TLS records.

Copy
Copied!
            

XLIO_TX_BUF_SIZE=16384

However, this change may require an increasing number of hugepages configured in the system.

Supported Ciphers

The below table lists all the supported offloaded ciphers.

TLS Version

Bits

Hardware Offload

OpenSSL Name

XLIO Support

TX

RX

1.2

128

TLS1.2-AES128-GCM

AES128-GCM-SHA256

YES

YES

ECDHE-ECDSA-AES128-GCM-SHA256

YES

YES

ECDHE-RSA-AES128-GCM-SHA256

YES

YES

256

TLS1.2-AES256-GCM

AES256-GCM-SHA384

YES1

YES1

ECDHE-ECDSA-AES256-GCM-SHA384

YES1

YES1

ECDHE-RSA-AES256-GCM-SHA384

YES1

YES1

1.3

128

TLS1.3-AES128-GCM

TLS_AES_128_GCM_SHA256

YES1

YES1

256

TLS1.3-AES256-GCM

TLS_AES_256_GCM_SHA384

YES1

YES1

  1. Not supported by RHEL v8.3 yet.

XLIO supports hardware timestamping for UDP-RX flow (only) with Precision Time Protocol (PTP).

When using XLIO on a server running a PTP daemon, XLIO can periodically query the kernel to obtain updated time conversion parameters which it uses in conjunction with the hardware time-stamp it receives from the NIC to provide synchronized time.

Prerequisites

  • Support devices: NIC clock

  • Set XLIO_HW_TS_CONVERSION environment variable to 4

Usage

  1. Set the SO_TIMESTAMPING option for the socket with value SOF_TIMESTAMPING_RX_HARDWARE:

    Copy
    Copied!
                

    uint8_t val = SOF_TIMESTAMPING_RX_HARDWARE setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));

  2. Set XLIO environment parameter XLIO_HW_TS_CONVERSION to 4.
    Example:
    Use the Linux kernel (v4.11) timestamping example found in the kernel source at: tools/testing/selftests/networking/timestamping/timestamping.c.

    Copy
    Copied!
                

    Serve $ sudo LD_PRELOAD=libxlio.so XLIO_HW_TS_CONVERSION=4 ./timestamping <iface> SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_RX_HARDWARE Client $ LD_PRELOAD=libxlio.so sockperf tp -i <server-ip> -t 3600 -p 6666 --mps 10 timestamping output: SOL_SOCKET SO_TIMESTAMPING SW 0.000000000 HW raw 1497823023.070846953 IP_PKTINFO interface index 8 SOL_SOCKET SO_TIMESTAMPING SW 0.000000000 HW raw 1497823023.170847260 IP_PKTINFO interface index 8 SOL_SOCKET SO_TIMESTAMPING SW 0.000000000 HW raw 1497823023.270847093 IP_PKTINFO interface index 8

Each PCI transaction between the system’s RAM and NIC starts at ~300 nsec (and increasing depended on buffer size). Application egress latency can be improved by reducing as many PCI transition as possible on the send path.

Today, XLIO achieves these goals by copying the WQE into the doorbell, and for small packets (<190 Bytes payload) XLIO can inline the packet into the WQE and reduce the data gather PCI transition as well. For data sizes above 190 bytes, an additional PCI gather cycle by the NIC is required to pull the data buffer for egress.

XLIO uses the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. The on-device-memory is a resource managed by XLIO and it is transparent to the user. The total size of the on-device-memory is limited to 256k for a single-port NIC and to 128k for dual-port NIC. Using XLIO_RING_DEV_MEM_TX, the user can set the amount of on-device-memory buffer allocated for each TX ring.

Prerequisites

  • Driver: MLNX_OFED version 23.07 and above

  • NIC: NVIDIA ConnectX®-6 Dx/ConnectX-7/BlueField-2/BlueField-3 and above

  • Protocol: Ethernet.

  • Set XLIO_RING_DEV_MEM_TX environment variable to best suit the application's requirements

Verifying On-Device Memory Capability in the Hardware

To verify “On Device Memory” capability in the hardware, run XLIO with DEBUG trace level:

Copy
Copied!
            

XLIO_TRACELEVEL=DEBUG LD_PRELOAD=<path to libxlio.so> <command line>

Look in the printout for a positive value of on-device-memory bytes.

For example:

Copy
Copied!
            

Pid: 1748924 Tid: 1748924 XLIO DEBUG : ibch[0x5633333c62f0]:229:print_val() mlx5_2: port(s): 1 vendor: 4125 fw: 22.31.1034 max_qp_wr: 32768 on_device_memory: 131072 packet_pacing_caps: min rate 1, max rate 100000000 Pid: 1748924 Tid: 1748924 XLIO DEBUG : ibch[0x56333340fa60]:229:print_val() mlx5_3: port(s): 1 vendor: 4125 fw: 22.31.1034 max_qp_wr: 32768 on_device_memory: 131072 packet_pacing_caps: min rate 1, max rate 100000000

To show and monitor On-Device Memory statistics, run xlio_stats tool.

Copy
Copied!
            

xlio_stats –p <pid> -v 3

For example:

Copy
Copied!
            

====================================================== RING_ETH=[0] Tx Offload: 858931 / 3402875 [kilobytes/packets] Rx Offload: 865251 / 3402874 [kilobytes/packets] Dev Mem Alloc: 16384 Dev Mem Stats: 739074 / 1784935 / 0 [kilobytes/packets/oob] ======================================================


Warning

In order to enable TCP_QUICKACK threshold, the user should modify TCP_QUICKACK_THRESHOLD parameter in the lwip/opt.h file and recompile XLIO.

While TCP_QUICKACK option is enabled, TCP acknowledgments are sent immediately, rather than being delayed in accordance to a normal TCP receive operation. However, sending the TCP acknowledge delays the incoming packet processing to after the acknowledgement has been completed which can affect performance.

TCP_QUICKACK threshold enables the user to disable the quick acknowledgement for payloads that are larger than the threshold. The threshold is effective only when TCP_QUICKACK is enabled, using setsockopt() or using XLIO_TCP_QUICKACK parameter. TCP_QUICKACK threshold is disabled by default.

XLIO daemon is responsible for managing all traffic control logic of all XLIO processes, including qdisc, u32 table hashing, adding filters, removing filters, removing filters when the application crashes.

For XLIO daemon usage instructions, refer to the Installing the XLIO Binary Package section in the Installation Guide.

To show and monitor TAP statistics, run the xlio_stats tool:

Copy
Copied!
            

xlio_stats –p <pid> -v 3

Example:

Copy
Copied!
            

====================================================== RING_TAP=[0] Master: 0x29e4260 Tx Offload: 4463 / 67209 [kilobytes/packets] Rx Offload: 5977 / 90013 [kilobytes/packets] Rx Buffers: 256 VF Plugouts: 1 Tap fd: 21 Tap Device: td34f15 ====================================================== RING_ETH=[1] Master: 0x29e4260 Tx Offload: 7527 / 113349 [kilobytes/packets] Rx Offload: 7527 / 113349 [kilobytes/packets] Retransmissions: 1 ======================================================

Output analysis:

  • RING_TAP[0] and RING_ETH[1] have the same master 0x29e4260 ring

  • 4463 Kbytes/67209 packets were sent from the TAP device

  • 5977 Kbytes/90013 packets were received from the TAP device

  • Plugout event occurred once

  • TAP device fd number was 21, TAP name was td34f15

Pass special structure as an argument into getsockopt() with SO_XLIO_PD to get protection domain information from ring used for current socket. This information can be available after setting connection for TX ring and bounding to device for RX ring. By default getting PD for TX ring. This case can be used with sendmsg(SCM_XLIO_PD) when the data portion contains an array of the elements with datatype as struct xlio_pd_key. Number of elements in this array should be equal to msg_iovlen value. Every data pointer in msg_iov has correspondent memory key.

Copy
Copied!
            

struct xlio_pd_attr { uint32_t flags; void* ib_pd; };

NVME over TCP (NVMEoTCP) hardware offload feature accelerates NVMEoTCP DIGEST calculation for transmitted NVME PDUs.

Prerequisites

  • Please refer to System Requirements

  • The card must support NVMEoTCP offload (ConnectX-7, Bluefield-3)

Usage

  1. Use the application to call setsockopt (fd, IPPROTO_TCP, TCP_ULP, "nvme", 4)

  2. Call: setsockopt(fd, NVDA_NVME, NVME_TX, &configure, sizeof(configure))where: uint32_t configure = XLIO_NVME_HDGST_ENABLE | XLIO_NVME_DDGST_ENABLE | XLIO_NVME_DDGST_OFFLOAD

    Note: If any of the setsockopt calls fail, offload is not supported.

  3. Call: setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &opt_val, sizeof(opt_val))
    where: int opt_val = 1

  4. Use the application to register TX buffers with PD from XLIO - see section Getting Ring Protection Domain section above

  5. Call sendmsg(fd, msg, MSG_ZEROCOPY) with extended zero-copy API
    where msg is of type msghdr

    1. With cmsghdr *cmsg = CMSG_FIRSTHDR(msg);

    2. cmsg->cmsg_level = SOL_SOCKET;

    3. cmsg->cmsg_type = SCM_XLIO_NVME_PD;

    4. cmsg->cmsg_len = msg->msg_controllen;

    5. With CMSG_DATA(cmsg) set to xlio_pd_key

See Kernel TX Zero Copy documentation: https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html

Full examples can be found in XLIO GIT repository: https://github.com/Mellanox/libxlio

© Copyright 2023, NVIDIA. Last updated on Dec 12, 2023.