MLNX_OFED Documentation Rev 4.9-2.2.4.0 LTS

Programming

Warning

This chapter is aimed for application developers and expert users that wish to develop applications over MLNX_OFED.

Warning

Supported on ConnectX®-4 adapter cards and above only.

Raw Ethernet programming enables writing an application that bypasses the kernel stack. To achieve this, packet headers and offload options need to be provided by the application.
For a basic example on how to use Raw Ethernet programming, refer to the Raw Ethernet Programming: Basic Introduction - Code Example Community post.

Packet Pacing

Packet pacing is a raw Ethernet sender feature that enables controlling the rate of each QP, per send queue.
For a basic example on how to use packet pacing per flow over libibverbs, refer to Raw Ethernet Programming: Packet Pacing - Code Example Community post.

TCP Segmentation Offload (TSO)

TCP Segmentation Offload (TSO) enables the adapter cards to accept a large amount of data with a size greater than the MTU size. The TSO engine splits the data into separate packets and inserts the user-specified L2/L3/L4 headers automatically per packet. With the usage of TSO, CPU is offloaded from dealing with a large throughput of data.
To be able to program that on the sender side, refer to the Raw Ethernet Programming: TSO - Code Example Community post.

ToS Based Steering

ToS/DSCP is an 8-bit field in the IP packet that enables different service levels to be assigned to network traffic. This is achieved by marking each packet in the network with a DSCP code and appropriating the corresponding level of service to it.
To be able to steer packets according to the ToS field on the receiver side, refer to the Raw Ethernet Programming: ToS - Code Example Community post.

Flow ID Based Steering

Flow ID based steering enables developing a code that will steer packets using flow ID when developing Raw Ethernet over verbs. For more information on flow ID based steering, refer to the Raw Ethernet Programming: Flow ID Steering - Code Example Community post.

VXLAN Based Steering

VXLAN based steering enables developing a code that will steer packets using the VXLAN tunnel ID when developing Raw Ethernet over verbs. For more information on VXLAN based steering, refer to the Raw Ethernet Programming: VXLAN Steering - Code Example Community post.

Warning

This feature is at beta level and is supported on ConnectX-5/ConnectX-5 Ex adapter cards and above only.

Device Memory is a new experimental verbs API that allows using on-chip memory, located on the device, as a data buffer for send/receive and RDMA operations. The device memory can be mapped and accessed directly by user and kernel applications, and can be allocated in various sizes, registered as memory regions with local and remote access keys for performing the send/receive and RDMA operations.
Using the device memory to store packets for transmission can significantly reduce transmission latency compared to the host memory.

Device Memory Programming Model

The new API introduces a similar procedure to the host memory for sending packets from the buffer:

  • ibv_exp_alloc_dm()/ibv_exp_free_dm() - to allocate/free device memory

  • ibv_exp_reg_mr - to register the allocated device memory buffer as a memory region and get a memory key for local/remote access by the device

  • ibv_exp_memcpy_dm - to copy data to/from a device memory buffer

  • ibv_post_send/ibv_post_receive - to request the device to perform a send/receive operation using the memory key

Warning

When using ibv_exp_memcpy_dm to copy data into an allocated device memory buffer, destination address within the device memory buffer must be in the multiples of 4 bytes.

Example:

Copy
Copied!
            

struct ibv_exp_dm *dm; struct ibv_mr *mr; struct ibv_exp_alloc_dm_attr dm_attr = {0}; struct ibv_exp_memcpy_dm_attr cpy_attr = {0}; struct ibv_exp_reg_mr_in mr_in = { .pd = my_pd, .addr = 0, .length = packet_size, .exp_access = IBV_EXP_ACCESS_LOCAL_WRITE, .create_flags = 0};    /* Device memory allocation request */ dm_attr.length = packet_size; dm = ibv_exp_alloc_dm(context, &dm_attr);    /* Device memory registration as memory region */ mr_in.dm = dm; mr_in.comp_mask = IBV_EXP_REG_MR_DM; mr = ibv_exp_reg_mr(&mr_in);   cpy_attr.memcpy_dir = IBV_EXP_DM_CPY_TO_DEVICE; cpy_attr.host_addr = (void *)my_packet_buffer; cpy_attr.length = packet_size; cpy_attr.dm_offset = 0; ibv_exp_memcpy_dm(dm, &cpy_attr);   struct ibv_sge list = { .addr = 0, .length = packet_size, .lkey = mr->lkey memory region */ }; struct ibv_send_wr wr = { .wr_id = my_wrid, .sg_list = &list, .num_sge = 1, .opcode = IBV_WR_SEND, .send_flags = IBV_SEND_SIGNALED, }; struct ibv_send_wr *bad_wr;   ibv_post_send(my_qp, &wr, &bad_wr);

Notes:

  • Offset in the dm buffer to start registration (does not have to start from offset 0).

  • Offset from beginning of dm buffer to copy to/from.

  • Offset from beginning of dm buffer where the packet is located.

RDMA-CM QP Timeout Control feature enables users to control the QP timeout for QPs created with RDMA-CM.

A new option in 'rdma_set_option’ function has been added to enable overriding calculated QP timeout, in order to provide QP attributes for QP modification. To achieve that, rdma_set_option() should be called with the new flag RDMA_OPTION_ID_ACK_TIMEOUT. Example:

Copy
Copied!
            

rdma_set_option(cma_id, RDMA_OPTION_ID, RDMA_OPTION_ID_ACK_TIMEOUT, &timeout, sizeof(timeout));

Applications which do not create a QP through rdma_create_qp() may want to postpone the ESTABLISHED event on the passive side, to let the active side complete an application-specific connection establishment phase. For example, modifying the init state of the QP created by the application to RTR state, or make some preparations for receiving messages from the passive side. The feature returns a new event on the active side: CONNECT_RESPONSE, instead of ESTABLISHED, if id->qp==NULL. This gives the application a chance to perform the extra connection setup. Afterwards, the new rdma_establish() API should be called to complete the connection and generate an ESTABLISHED event on the passive side.

In addition, this feature exposes the 'rdma_init_qp_attr' function in librdmacm API, which enables applications to get the parameters for creating Address Handler (AH) or control QP attributes after its creation.

© Copyright 2023, NVIDIA. Last updated on Oct 23, 2023.