What can I help you with?
NVIDIA DPDK Documentation MLNX_DPDK_22.11_2310.5.1 LTS

Connection Tracking Window Validation

Connection Tracking (CT) is the process of tracking a TCP connection between two endpoints (E.g., between server and client). This involves maintaining the state of the connection, identifying state changing packets, identifying malformed packets, identifying packets that do not conform to the protocol or the current connection state, and more.

CT is an important process used in many networking solutions, that require state awareness. These networking solutions include stateful firewalls, the Linux kernel, and more.

Currently, the CT is executed in the software level, but with the world becoming more and more data driven, networking applications are required to run with higher bandwidth, and software is becoming the bottleneck. This is where DPDK CT comes in, we offer the ability to offload the CT process to the hardware, accommodating our speed of light NICs.

To make the CT offload possible, we added hardware support for the operation (CT HW module), and a software rte_flow API that connects the user to the hardware.

The CT offload can track a single connection only, meaning every connection should be offloaded separately.

The HW module can track a given connection by maintaining a structure in HW referred to as CT context, which holds information about a single TCP connection. As such, every connection being tracked requires its own CT context object. However, the HW does not have the ability to initialize the context.

When the HW module receives a packet that is part of a tracked connection, it checks the packet based on that connection’s context, stores the result for later usage (CT result matching), and then updates the context based on the received packet.

This section explains how for a given connection, the context is initialized, and how to associate that connection’s packets with a context object.

The API brings these main changes:

  • rte_flow CT Context

    The struct rte_flow_action_conntrackstruct is almost identical to the HW CT context mentioned above. The SW context holds the following two extra fields .peer_port and .is_original_portexplained in details below.

  • CT rte_flow Item (RTE_FLOW_ITEM_TYPE_CONNTRACK)

    This item allows the user to insert an rte_flow rule that matches the result of the CT HW module (CT action). To insert this rule, make sure that the packet passes through the CT HW module before reaching this rule. You can match on 5 possible results (valid/state_change/error/disabled/bad_packet) or combinations.

  • CT rte_flow Action (RTE_FLOW_ACTION_TYPE_CONNTRACK)

    The action .conf field points to a SW CT context and activates the CT HW module on this rule. When this action is created, the HW creates its own CT context and copies the values of the SW context.

  • CT Context Create / Modify

    • Create: rte_flow_action_handle_createfunction creates a CT action effectively initializing the HW CT context. The function also returns a handle to the created action. The handle is used later to insert the action as part of multiple rte_flow rules. Meaning the handle can connect a HW context with the packets that match on the rule.

    • Modify: rte_flow_action_handle_update function allows you to modify the HW context, as well as changing the .is_original_portwhich will be updated separately as explained below.

New Data Structures

struct rte_flow_action_conntrack (SW CT Context)

Type

Name

Meaning

uint16_t

peer_port

Port_id of the port where the reply direction traffic will be received. (SW only field)

uint32_t:1

Is_original_dir

Should be 1 if the items of the rte_flow match the original direction. This only affects rule insertion. (SW only field)

uint32_t:1

enable

Should be 1 to enable CT hardware offload.

uint32_t:1

live_connection

Should be 1 if a 3-way handshake was made.

uint32_t:1

selective_ack

Should be 1 if the connection supports TCP selective ack.

uint32_t:1

challenge_ack_passed

Should be 1 if the challenge ack has passed.

uint32_t:1

last_direction

Should be 0 if the last packet seen came from the original direction.

uint32_t:1

liberal_mode

Should be 1 if you want the CT to track the state changes only (without any other TCP checks).

enum rte_flow_conntrack_state

state

The current state of the connection.

uint8_t

max_ack_window

Maximum window scale seen (MAX(original.wscale, reply.wscale)).

uint8_t

retransmission_limit

The maximum amount of packets that can be retransmitted.

struct rte_flow_tcp_dir_param

original_dir

Holds information about the packets received from the original direction. (See below)

struct rte_flow_tcp_dir_param

reply_dir

Holds information about the packets received from the reply direction. (see below).

uint16_t

last_window

Window of the last packet seen.

enum rte_flow_conntrack_index

last_index

The type of the last packet seen. (see below)

uint32_t

last_seq

The sequence number of the last packet seen.

uint32_t

last_ack

The acknowledgement number of the last packet seen.

uint32_t

last_end

The ack that should be sent to acknowledge the last packet seen.

(seq + data_len)

enum rte_flow_conntrack_state (CT state)

Value

Meaning

RTE_FLOW_CONNTRACK_STATE_SYN_RECV

SYN-ACK packet seen

RTE_FLOW_CONNTRACK_STATE_ESTABLISHED

ACK packet seen (3-way handshake ACK)

RTE_FLOW_CONNTRACK_STATE_FIN_WAIT

FIN packet seen

RTE_FLOW_CONNTRACK_STATE_CLOSE_WAIT

ACK packet seen after FIN

RTE_FLOW_CONNTRACK_STATE_LAST_ACK

FIN packet seen after FIN

RTE_FLOW_CONNTRACK_STATE_TIME_WAIT

Last ACK seen after two FINs

enum rte_flow_conntrack_tcp_last_index (packet type/TCP flags)

Value

Meaning

RTE_FLOW_CONNTRACK_INDEX_NONE

No TCP flags set.

RTE_FLOW_CONNTRACK_INDEX_SYN

SYN flag set.

RTE_FLOW_CONNTRACK_INDEX_SYN_ACK

SYN and ACK flags set.

RTE_FLOW_CONNTRACK_INDEX_FIN

FIN flag set.

RTE_FLOW_CONNTRACK_INDEX_ACK

ACK flag set.

RTE_FLOW_CONNTRACK_INDEX_RST

RST flag set.

struct rte_flow_tcp_dir_param

Type

Name

Meaning For context.original_dir

Meaning For context.reply_dir

uint32_t:4

scale

Original direction window scaling factor. Set to 0xF to disable window scaling.

Reply direction window scaling factor. Set to 0xF to disable window scaling.

uint32_t:1

close_initiated

Should be 1 if the original direction sent a FIN packet.

Should be 1 if the reply direction sent a FIN packet.

uint32_t:1

last_ack_seen

Should be 1 if the reply direction sent an ACK packet.

Should be 1 if the original direction sent an ACK packet.

uint32_t:1

data_unacked

Original direction has some data packets that have not been ACKed yet.

reply direction has some data packets that have not been ACKed yet.

uint32_t

sent_end

The expected ack number to be sent from the reply direction.

(orig.seq + orig.data_len)

The expected ack number to be sent from the original direction.

(reply.seq + reply.data_len)

uint32_t

reply_end

The maximum seq number that the original direction can send.

(reply.ack + reply.actual_window)

The maximum seq number that the reply direction can send.

(orig.ack + orig.actual_window)

uint32_t

max_win

The last actual window received from the reply direction.

(reply.actual_window)

The last actual window received from the original direction.

(orig.actual_window)

uint32_t

max_ack

The last ack number that was sent from the original direction. (orig.ack)

The last ack number that was sent from the reply direction. (reply.ack)

  • The CT action must be wrapped by an action_handle

  • Retransmission limit checking cannot work

  • The CT action cannot be inserted in group 0

  • Cannot track a connection before seeing the first 2 handshake packets (SYN and SYN-ACK)

  • No direction validation is done for closing sequence (FIN)

  • Maximum number of supported ports with CT action is 16

For the best flow insertion usage experience, the following actions/steps are recommended:

  • Use “hint_num_of_rules_log” to configure the maximum expected number of rules in each group, see Hint to the Driver on the Number of Flow Rules. The value is the log number of the expected rules. For example, assuming cross port and each port is going to have 4M rules, the value should be 22. In case of a single port (meaning we add 2 rules to the same table), the expected number of rules is 8M so the value should be 23.

  • Reuse CT objects, the usage flow should be as follows upon new connection arrival:

    • Check if there is a free CT object in the app pool. The CT object should match the direction of the original ones that were set when the CT object was created. For example, if the app created a CT with original port = 0 and reply port 1, it can be reused only in the same configuration.

    • If the CT object is reused, the application should call the modify the action with the new requested state.

    • If the app does not have a free CT object, then it should allocate a new one by calling the create function.

The app creates the rules just like before with the CT object. To close a connection, the app removes the rules and adds the CT object to the application pool.

Note

Always keep at least one rule per matching criteria to save re-allocation time. This can be a dummy rule that will never be hit, which can be inserted as the first rule.

General Comments

  1. The export part will be changed in the June release and will be part of the rule creation.

  2. A new NV config mode “icm_cache_mode_large_scale_steering” is added to enable less cache misses and improve performance for cases when working with many steering rules.

    This capability is enabled by setting the mlxconfig parameter "ICM_CACHE_MODE"” to "1": mlxconfig -d <device> set ICM_CACHE_MODE=1. Note that this optimization is intended to be used with very large number of steering rules.

  3. To save insertion rate time, the CT objects can be pre-allocated at startup.

  4. Querying the CT state will have a huge effect on the performance. All measurements are done without the query.

  5. Before closing the device, all CT objects should be destroyed.

CT is offloaded using the rte_flow rules. Below is explanation on how to create rules in three different groups:

  • Group 0: Has a single rule that matches on a TCP packet and jumps to group 1.

  • Group 1: Has two rules, the first one matches on the connection’s 5-tuple, tags the packet with original, applies CT action, and then jumps to group 2. The second rule is similar to the first however, it matches on the reverse 5-tuple, and tags the packet with reply. Both rules share the same CT action (and HW CT context). These rules should be created for every 5-tuple (connection).

  • Group 2: Has 6 rules, 4 of them are used to inform the application about an event, while the remaining two rules match on a valid packet. There is a rule for every direction where for each direction the packet will be forwarded to a different hairpin queue.

The flows are as follows:

image2021-4-1_17-10-51-version-1-modificationdate-1736074518990-api-v2.png

As explained above, the rules in group 1 are inserted for every new connection. We refer to inserting these rules as offloading a connection.

To offload a connection, you must use the API as follows:

  1. Create a SW CT context that matches the connection parameters. It is important to set .is_original_port= 1.

  2. Create a CT action that points to the SW context, and then commit it using rte_flow_action_handle_createwhich will return an action_handle.

  3. Insert the rule(*) that matches on the 5-tuple of the connection. However, instead of using action CT, you can use the action_handle created and insert that.

  4. Modify the CT action that was created in 2, such that .is_original_port= 0. You can modify the action by using rte_flow_action_handle_updateand providing the struct rte_flow_modify_conntrack parameter.

  5. Insert the rule(**) that matches on the reverse 5-tuple of the connection. Same as 3.

The below is an example of how the CT context should be initialized (what values to put). In our example we are interested in tracking a TCP connection only after a three-way handshake has passed, as such we will be configuring the CT context based on information gathered from these 3 packets.

The three-way handshake packets are as follows:

handshake_pkts[0] == syn_pkt == S

Ethernet Header

dst

src

type

00:15:5d:61:d4:61

00:15:5d:61:d9:61

IPv4

IP Header

version

ihl

tos

len

id

flags

frag

ttl

proto

chksum

src

dst

4L

5L

0x10

60

11800

DF

0L

64

tcp

0x28ba

172.17.28.218

10.210.16.29

TCP Header

sport

dpor

seq

ack

dataofs

reserved

flags

window

chksum

urgptr

40314

22

2632987378

0

10L

0L

S

65280

0x7afb

0

TCP Options

[('MSS', 1360), ('SAckOK', ''), ('Timestamp', (3472272931, 0)), ('NOP', None), ('WScale', 7) ]

handshake_pkts[1] == syn_ack_pkt == SA

Ethernet Header

dst

src

type

00:15:5d:61:d9:61

00:15:5d:61:d4:61

IPv4

IP Header

version

ihl

tos

len

id

flags

frag

ttl

proto

chksum

src

dst

4L

5L

0x60

60

0

DF

0L

55

tcp

0x5f82

10.210.16.29

172.17.28.218

TCP Header

sport

dport

seq

ack

dataofs

rsv

flags

window

chksum

urgptr

22

40314

2532480966

2632987379

10L

0L

SA

28960

0x49f5

0

TCP Options

[('MSS', 1460), ('SAckOK', ''), ('Timestamp', (1297236582, 3472272931)), ('NOP', None), ('WScale', 7) ]

handshake_pkts[2] == ack_pkt == A

Ethernet Header

dst

src

type

00:15:5d:61:d4:61

00:15:5d:61:d9:61

IPv4

IP Header

version

ihl

tos

len

id

flags

frag

ttl

proto

chksum

src

dst

4L

5L

0x10

52

11801

DF

0L

64

tcp

0x28c1

172.17.28.218

10.210.16.29

TCP Header

sport

dport

seq

ack

dataofs

rsv

flags

window

chksum

urgptr

40314

22

2632987379

2532480967

8L

0L

A

510

0xe7db

0

TCP Options

[('NOP', None), ('NOP', None), ('Timestamp', (3472272939, 1297236582))]

SW CT Context Initial Values

Field Name

Real Value

Semantic Value

peer_port

0

Is_original_dir

1

enable

1

live_connection

1

selective_ack

1

challenge_ack_passed

0

last_direction

0

liberal_mode

0

state

RTE_FLOW_CONNTRACK_STATE_ESTABLISHED

max_ack_window

7

max( S.tcp_options.wscale , SA.tcp_options.wscale )

retransmission_limit

5

Constant value

original_dir

scale

7

S.tcp_options.wscale

close_initiated

0

last_ack_seen

1

data_unacked

0

sent_end

2632987379

A.seq

reply_end

2632987379 + 28960

SA.ack + SA.window

max_win

510 << 7

A.window << S.tcp_options.wscale

max_ack

2532480967

SA.ack

reply_dir

scale

7

SA.tcp_options.wscale

close_initiated

0

last_ack_seen

1

data_unacked

0

sent_end

2532480966 + 1

SA.seq + 1

reply_end

2532480967 + ( 510 << 7 )

A.ack +

( A.window << S.tcp_options.wscale )

max_win

28960

SA.window

max_ack

2632987379

SA.ack

last_window

510

A.window

last_index

RTE_FLOW_CONNTRACK_INDEX_ACK

last_seq

2632987379

A.seq

last_ack

2532480967

A.ack

last_end

2632987379

A.seq

© Copyright 2024, NVIDIA. Last updated on Jan 9, 2025.