



Before using any DOCA Flow function, it is mandatory to call DOCA Flow initialization, doca_flow_init() , which initializes all resources used by DOCA Flow.



This mode ( mode_args ) defines the basic traffic in DOCA. It creates some miss rules when the DOCA port initialized. Currently, DOCA supports 3 types:

vnf The packet arrives from one side of the application, is processed, and sent from the other side. The miss packet by default goes to the RSS of all queues. The following diagram shows the basic traffic flow in vnf mode. Packet1 firstly misses to host RSS queues. The app captures this packet and decides how to process it and then creates a pipe entry. Packet2 will hit this pipe entry and do the action, for example, for VXLAN, will do decap, modify, and encap, then is sent out from P1.

switch Used for internal switching, only representor ports are allowed, for example, uplink representors and SF/VF representors. Packet is forwarded from one port to another. If a packet arrives from an uplink and does not hit the rules defined by the user's pipe. Then the packet is received on all RSS queues of the representor of the uplink. The following diagram shows the basic flow of traffic in switch mode. Packet1 firstly misses to host RSS queues. The app captures this packet and decides which representor goes, and then sets the rule. Packets hit this rule and go to representor0.

remote-vnf Remote mode is a BlueField mode only, with two physical ports (uplinks). Users must use doca_flow_port_pair to pair one physical port and one of its representors. A packet from this uplink, if it does not hit any rules from the users, is firstly received on this representor. Users must also use doca_flow_port_pair to pair two physical uplinks. If a packet is received from one uplink and hits the rule whose FWD action is to another uplink, then the packets are sent out from it. The following diagram shows the basic traffic flow in remote-vnf mode. Packet1, from BlueField uplink P0, firstly misses to host VF0. The app captures this packet and decides whether to drop it or forward it to another uplink (P1). Then, using gRPC to set rules on P0, packet2 hits the rule, then is either dropped or is sent out from P1.

DOCA Flow API serves as an abstraction layer API for network acceleration. The packet processing in-network function is described from ingress to egress and, therefore, a pipe must be attached to the origin port. Once a packet arrives to the ingress port, it starts the hardware execution as defined by the DOCA API.

doca_flow_port is an opaque object since the DOCA Flow API is not bound to a specific packet delivery API, such as DPDK. The first step is to start the DOCA Flow port by calling doca_flow_port_start() . The purpose of this step is to attach user application ports to the DOCA Flow ports.

When DPDK is used, the following configuration must be provided:

Copy Copied! enum doca_flow_port_type type = DOCA_FLOW_PORT_DPDK_BY_ID; const char *devargs = "1";

The devargs parameter points to a string that has the numeric value of the DPDK port_id in decimal format. The port must be configured and started before calling this API. Mapping the DPDK port to the DOCA port is required to synchronize application ports with hardware ports.

Pipe is a template that defines packet processing without adding any specific HW rule. A pipe consists of a template that includes the following elements:

Match

Monitor

Actions

Forward

The following diagram illustrates a pipe structure.

The creation phase allows the HW to efficiently build the execution pipe. After the pipe is created, specific entries can be added. Only a subset of the pipe can be used (e.g. skipping the monitor completely, just using the counter, etc).



Match is a mandatory field when creating a pipe. Using the following struct, users must define the fields that should be matched on the pipe.

For each doca_flow_match field, users choose whether the field is:

Ignored (wild card) – the value of the field is ignored.

Constant – all entries in the pipe must have the same value for this field. Users should not put a value for each entry.

Changeable – per entry, the user must provide the value to match. Note: L4 type, L3 type, and tunnel type cannot be changeable.

The match field type can be defined either implicitly or explicitly using the doca_flow_pipe_cfg.match_mask pointer. match_mask==NULL is implicit. Otherwise, it is explicit.



Match Type Pipe Value Pipe Mask, match_mask Entry Value Wildcard (match any) 0 Null pointer N/A Constant Pipe value Null pointer N/A Variable (per entry) Full mask (0xff...) Null pointer Per-entry value

To match implicitly, the following should be taken into account.

Ignored fields: Field is zeroed Pipeline has no comparison on the field

Constant fields These are fields that have a constant value. For example, as shown in the following, the tunnel type is VXLAN. Copy Copied! match.tun.type = DOCA_FLOW_TUN_VXLAN; These fields only need to be configured once, not once per new pipeline entry.

Changeable fields These are fields that may change per entry. For example, the following shows an inner 5-tuple which are set with a full mask. Copy Copied! match.in_dst_ip.ipv4_addr = 0xffffffff; If this is the constant value required by user, then they should set zero on the field when adding a new entry.

Example The following is an example of a match on the VXLAN tunnel, where for each entry there is a specific IPv4 destination address, and an inner 5-tuple. Copy Copied! static void build_underlay_overlay_match(struct doca_flow_match *match) { //outer match->out_dst_ip.ipv4_addr = 0xffffffff; match->out_l4_type = DOCA_PROTO_UDP; match->out_dst_port = DOCA_VXLAN_DEFAULT_PORT; match->tun.type = DOCA_FLOW_TUN_VXLAN; match->tun.vxlan_tun_id = 0xffffffff; //inner match->in_dst_ip.ipv4_addr = 0xffffffff; match->in_dst_ip.type = DOCA_FLOW_IP4_ADDR; match->in_src_ip.ipv4_addr = 0xffffffff; match->in_src_ip.type = DOCA_FLOW_IP4_ADDR; match->in_l4_type = DOCA_PROTO_TCP; match->in_src_port = 0xffff; match->in_dst_port = 0xffff; }

Match Type Pipe Value Pipe Mask, match_mask Entry Value Wildcard (match any) 0 0 N/A Constant Pipe value Full mask (0xff…) N/A Variable (per entry) 0 Mask Per-entry value

Users may provide a mask on a match. In this case, there are two doca_flow_match items: The first contains constant values and the second contains masks.

Ignored fields Field is zeroed Pipeline has no comparison on the field Copy Copied! match_mask.in_dst_ip.ipv4_addr = 0;

Constant fields These are fields that have a constant value. For example, as shown in the following, the tunnel type is VXLAN and the mask should be full. Copy Copied! match.tun.type = DOCA_FLOW_TUN_VXLAN; match_mask.tun.type = 0xffffffff; Once a field is defined as constant, the field's value cannot be changed per entry. Users must set constant fields to zero when adding entries so as to avoid ambiguity.

Changeable fields These are fields that may change per entry (e.g. inner 5-tuple). Their value should be zero and the mask should be full. Copy Copied! match.in_dst_ip.ipv4_addr = 0; match_mask.in_dst_ip.ipv4_addr = 0xffffffff; Note that for IPs, the prefix mask can be used as well.

Similarly to setting pipe match, actions also have a template definition.

Similarly to doca_flow_match in the creation phase, only the subset of actions that should be executed per packet are defined. This is done in a similar way to match, namely by classifying a field of doca_flow_match to one of the following:

Ignored field – field is zeroed, modify is not used

Constant fields – when a field must be modified per packet, but the value is the same for all packets, a one-time value on action definitions can be used

Changeable fields – fields that may have more than one possible value, and the exact values are set by the user per entry Copy Copied! match_mask.in_dst_ip.ipv4_addr = 0xffffffff; Metadata is considered as per-packet changeable fields, pipe action is used as a mask.

Boolean fields – Boolean values, encap and decap are considered as constant values. It is not allowed to generate actions with encap=true and to then have an entry without an encap value.

For example:

Copy Copied! static void create_decap_inner_modify_actions(struct doca_flow_actions *actions) { actions->decap = true; actions->mod_dst_ip.ipv4_addr = 0xffffffff; }

It is possible to force constant modification or per-entry modification with action description type ( CONSTANT or SET ) and mask. For example:

Copy Copied! static void create_constant_modify_actions(struct doca_flow_actions *actions， struct doca_flow_action_descs *descs) { actions->mod_src_port = 0x1234; descs->src_port.type = DOCA_FLOW_ACTION_CONSTANT; descs->outer.src_port.mask.u64 = 0xffff; }

Action description can be used to copy between packet field and metadata. For example:

Copy Copied! static void create_copy_packet_to_meta_actions(struct doca_flow_match *match， struct doca_flow_action_descs *descs) { descs->src_ip.type = DOCA_FLOW_ACTION_COPY; descs->src_ip.copy.dst = &match->meta.u32[1]; }

Creating a pipe is possible using a list of multiple actions. For example:

Copy Copied! static void create_multi_actions_for_pipe_cfg() { struct doca_flow_actions *actions_arr[2]; struct doca_flow_actions actions_0 = {0}, actions_1 = {0}; struct doca_flow_pipe_cfg pipe_cfg = {0}; /* input configurations for actions_0 and actions_1 */ actions_arr[0] = &actions_0; actions_arr[1] = &actions_1; pipe_cfg.attr.nb_actions = 2; pipe_cfg.actions = actions_arr; }

Pipe Creation Entry Creation action_desc Pipe Actions Entry Actions doca_flow_action_type Configuration DOCA_FLOW_ACTION_AUTO Derived from pipe actions. No specific config 0 – field ignored, no modification N/A val != 0 – apply this val to all entries N/A val = 0xfff – changeable field Define val per entry Specific for Metadata - the meta field in the actions is used as a mask. Define val per entry DOCA_FLOW_ACTION_CONSTANT Pipe action is constant. Define the mask Define val to apply for all entries N/A DOCA_FLOW_ACTION_SET Set value from entry action. Define the mask N/A Define val per entry DOCA_FLOW_ACTION_ADD Add field value. Define the dst field and width N/A Define val per entry DOCA_FLOW_ACTION_COPY Copy field to another field. Define the source and destination fields. Meta field → header field

Header field → meta field

Meta field → meta field N/A N/A

Field Match Modification Add Copy meta.pkt_meta x x x meta.u32 x x x Packet outer fields x (field list) x (field list) TTL Between meta[1] Packet tunnel x To meta Packet inner fields x (field list) To meta[1]

[1] Copy from meta to IP is not supported.

If a meter policer should be used, then it is possible to have the same configuration for all policers on the pipe or to have a specific configuration per entry. The meter policer is determined by the FWD action. If an entry has NULL FWD action, the policer FWD action is taken from the pipe.

The monitor also includes the aging configuration, if the aging time is set, this entry ages out if timeout passes without any matching on the entry. User data is used to map user usage. If the user_data field is set, when the entry ages out, query API returns this user_data . If user_data is not configured by the application, the aged pipe entry handle is returned.

For example:

Copy Copied! static void build_entry_monitor(struct doca_flow_monitor *monitor, void *user_ctx) { monitor->flags |= DOCA_FLOW_MONITOR_AGING; monitor->aging = 10; monitor->user_data = (uint64_t)user_ctx; }

Refer to Pipe Entry Aged Query for more information.

The FWD (forwarding) action is the last action in a pipe, and it directs where the packet goes next. Users may configure one of the following destinations:

Send to software (representor)

Send to wire

Jump to next pipe

Drop packets

The FORWARDING action may be set for pipe create, but it can also be unique per entry.

A pipe can be defined with constant forwarding (e.g., always send packets on a specific port). In this case, all entries will have the exact same forwarding. If forwarding is not defined when a pipe is created, users must define forwarding per entry. In this instance, pipes may have different forwarding actions.

When a pipe includes meter monitor <cir, cbs> , it must have fwd defined as well as the policer.

The following is an RSS forwarding example:

Copy Copied! fwd->type = DOCA_FLOW_FWD_RSS; fwd->rss_queues = queues; fwd->rss_flags = DOCA_FLOW_RSS_IP | DOCA_FLOW_RSS_UDP; fwd->num_of_queues = 4;

Queues point to the uint16_t array that contains the queue numbers. When a port is started, the number of queues is defined, starting from zero up to the number of queues minus 1. RSS queue numbers may contain any subset of those predefined queue numbers. For a specific match, a packet may be directed to a single queue by having RSS forwarding with a single queue.

Changeable RSS forwarding is supported. When creating the pipe, the num_of_queues must be set to 0xff , then different forwarding RSS information can be set when adding each entry.

Copy Copied! fwd->num_of_queues = 0xffffffff;

MARK is an optional parameter that may be communicated to the software. If MARK is set and the packet arrives to the software, the value can be examined using the software API. When DPDK is used, MARK is placed on the struct rte_mbuf . (See "Action: MARK" section in official DPDK documentation.) When using the Kernel, the MARK value is placed on the struct sk_buff MARK field.

The port_id is given in struct doca_flow_port_cfg .

The packet is directed to the port. In many instances the complete pipe is executed in the HW, including the forwarding of the packet back to the wire. The packet never arrives to the SW.

Example code for forwarding to port:

Copy Copied! struct doca_flow_fwd *fwd = malloc(sizeof(struct doca_flow_fwd)); memset(fwd, 0, sizeof(struct doca_flow_fwd)); fwd->type = DOCA_FLOW_FWD_PORT; fwd->port_id = port_cfg->port_id;

The type of forwarding is DOCA_FLOW_FWD_PORT and the only data required is the port_id as defined in DOCA_FLOW_PORT .

Changeable port forwarding is also supported. When creating the pipe, the port_id must be set to 0xff , then different forwarding port_id values can be set when adding each entry.

Once all parameters are defined, the user should call doca_flow_pipe_create to create a pipe.

The return value of the function is a handle to the pipe. This handle should be given when adding entries to pipe. If a failure occurs, the function returns NULL , and the error reason and message are put in the error argument if provided by the user.

Refer to the NVIDIA DOCA Libraries API Reference Manual to see which fields are optional and may be skipped. It is typically recommended to set optional fields to 0 when not in use. See Miss Pipe and Control Pipe for more information.

Once a pipe is created, a new entry can be added to it. These entries are bound to a pipe, so when a pipe is destroyed, all the entries in the pipe are removed. Please refer to section Pipe Entry for more information.

There is no priority between pipes or entries. The way that priority can be implemented is to match the highest priority first, and if a miss occurs, to jump to the next PIPE. There can be more than one PIPE on a root as long the pipes are not overlapping. If entries overlap, the priority is set according to the order of entries added. So, if two root pipes have overlapping matching and PIPE1 has higher priority than PIPE2, users should add an entry to PIPE1 after all entries are added to PIPE2.

An entry is a specific instance inside of a pipe. When defining a pipe, users define match criteria (subset of fields to be matched), the type of actions to be done on matched packets, monitor, and, optionally, the FWD action.

When a user calls doca_flow_pipe_add_entry() to add an entry, they should define the values that are not constant among all entries in the pipe. And if FWD is not defined then that is also mandatory.

DOCA Flow is designed to support concurrency in an efficient way. Since the expected rate is going to be in millions of new entries per second, it is mandatory to use a similar architecture as the data path. Having a unique queue ID per core saves the DOCA engine from having to lock the data structure and enables the usage of multiple queues when interacting with HW.

Each core is expected to use its own dedicated pipe_queue number when calling doca_flow_pipe_entry . Using the same pipe_queue from different cores causes a race condition and has unexpected results.

Upon success, a handle is returned. If a failure occurs, a NULL value is returned, and an error message is filled. The application can keep this handle and call remove on the entry using its handle.

Copy Copied! int doca_flow_pipe_rm_entry(uint16_t pipe_queue, void *usr_ctx, struct doca_flow_pipe_entry *entry);

By default, no counter is added. If defined in monitor, a unique counter is added per entry.

Note: Having a counter per entry affects performance and should be avoided if it is not required by the application.

When a counter is present, it is possible to query the flow and get the counter's data by calling doca_flow_query .

The retrieved statistics are stored in struct doca_flow_query.

When a user calls doca_flow_aged_query() , this query is used to get the aged-out entries by the time quota in microseconds. The entry handle or the user_data input is returned by this API.

Since the number of flows can be very large, the query of aged flows is limited by a quota in microseconds. This means that it may return without all flows and requires the user to call it again. When the query has gone over all flows, a full cycle is done.

The struct doca_flow_aged_query contains the element user_data which contains the aged-out flow contexts.

Users can define multiple actions per pipe. This gives the user the option to define different actions per entry in the same pipe by providing the action_idx in struct doca_flow_actions .

For example, to create two flows with the same match but with different actions, users can provide two actions upon pipe creation, Action_0 and Action_1 , which have indices 0 and 1 respectively in the actions array in the pipe configuration. Action_0 has modify_mac , and Action_1 has modify_ip .

Users can also add two kinds of entries to the pipe, the first one with Action_0 and the second with Action_1 . This is done by assigning 0 in the action_idx field in struct doca_flow_actions when creating the first entry and 1 when creating the second one.

Note: Only one root pipe is allowed. If more than one is needed, create a control pipe as root and forward the packets to relevant non-root pipes.

To set priority between pipes, users must use miss-pipes. Miss pipes allow to look up entries associated with pipe X, and if there are no matches, to jump to pipe X+1 and perform a lookup on entries associated with pipe X+1.

The following figure illustrates the HW table structure:

The first lookup is performed on the table with priority 0. If no hits are found, then it jumps to the next table and performs another lookup.

The way to implement a miss pipe in DOCA Flow is to use a miss pipe in FWD. In struct doca_flow_fwd , the field next_pipe signifies that when creating a pipe, if a fwd_miss is configured then if a packet does not match the specific pipe, steering should jump to next_pipe in fwd_miss .

Note: fwd_miss is of type struct doca_flow_fwd but it only implements two forward types of this struct: DOCA_FLOW_FWD_PIPE – forwards the packet to another pipe

– forwards the packet to another pipe DOCA_FLOW_FWD_DROP – drops the packet Other forwarding types (e.g., forwarding to port or sending to RSS queue) are not supported.

next_pipe is defined as doca_flow_pipe and created by doca_flow_pipe_create . To separate miss_pipe and a general one, is_root is introduced in struct doca_flow_pipe_cfg . If is_root is true, it means the pipe is a root pipe executed on packet arrival. Otherwise, the pipe is next_pipe .

When fwd_miss is not null, the packet that does not match the criteria is handled by next_pipe which is defined in fwd_miss .

In internal implementations of doca_flow_pipe_create , if fwd_miss is not null and the forwarding action type of miss_pipe is DOCA_FLOW_FWD_PIPE , a flow with the lowest priority is created that always jumps to the group for the next_pipe of the fwd_miss . Then the flow of next_pipe can handle the packets, or drop the packets if the forwarding action type of miss_pipe is DOCA_FLOW_FWD_DROP .

For example, VXLAN packets are forwarded as RSS and hairpin for other packets. The miss_pipe is for the other packets (non-VXLAN packets) and the match is for general Ethernet packets. The fwd_miss is defined by miss_pipe and the type is DOCA_FLOW_FWD_PIPE . For the VXLAN pipe, it is created by doca_flow_create() and fwd_miss is introduced.

Since, in the example, the jump flow is for general Ethernet packets, it is possible that some VXLAN packets match it and cause conflicts. For example, VXLAN flow entry for ipA is created. A VXLAN packet with ipB comes in, no flow entry is added for ipB , so it hits miss_pipe and is hairpinned.

A control pipe is introduced to handle the conflict. When a user calls doca_flow_create_control_pipe() , the new control pipe is created without any configuration except for the port. Then the user can add different matches with different forwarding and priorities when there are conflicts.

The user can add a control entry by calling doca_flow_control_pipe_add_entry() .

priority must be defined as higher than the lowest priority (3) and lower than the highest one (0).

The other parameters represent the same meaning of the parameters in doca_flow_pipe_create . In the example above, a control entry for VXLAN is created. The VLXAN packets with ipB hit the control entry.

doca_flow_pipe_lpm uses longest prefix match (LPM) matching. LPM matching is limited to a single field of the doca_flow_match (e.g., the outer destination IP). Each entry is consisted of a value and a mask (e.g., 10.0.0.0/8, 10.10.0.0/16, etc). The LPM match is defined as the entry that has the maximum matching bits. For example, using the two entries 10.7.0.0/16 and 10.0.0.0/8, the IP 10.1.9.2 matches on 10.0.0.0/8 and IP 10.7.9.2 matches on 10.7.0.0/16 because 16 bits match.

The monitor, actions, and FWD of the DOCA Flow LPM pipe works the same as the basic DOCA Flow pipe.

doca_flow_pipe_lpm insertion max latency can be measured in milliseconds in some cases and, therefore, it is better to insert it from the control path. To get the best insertion performance, entries should be added in large batches.

Note: An LPM pipe cannot be a root pipe. You must create a pipe as root and forward the packets to the LPM pipe.

doca_flow_pipe_ordered_list allows the user to define a specific order of actions and multiply the same type of actions (i.e., specific ordering between counter/meter and encap/decap).

An ordered list pipe is defined by an array of actions (i.e., sequences of actions). Each entry can be an instance one of these sequences. An ordered list pipe may consist of up to an array of 8 different actions. The maximum size of each action array is 4 elements. Resource allocation may be optimized when combining multiple action arrays in one ordered list pipe.

Users can enable hardware steering mode by setting devarg dv_flow_en to 2 .

The following is an example of running DOCA with hardware steering mode:

Copy Copied! .... –a 03:00.0, dv_flow_en=2 –a 03:00.1, dv_flow_en=2....

The following is an example of running DOCA with software steering mode:

Copy Copied! .... –a 03:00.0 –a 03:00.1 ....

The dv_flow_en=2 means that hardware steering mode is enabled.

In the struct doca_flow_cfg , the member mode_args represents DOCA applications. If it is defined with hws (e.g., "vnf,hws" , "switch,hws" , "remote_vnf,hws" ) then hardware steering mode is enabled.

To create an entry by calling doca_flow_pipe_add_entry , the parameter flags can be set as DOCA_FLOW_WAIT_FOR_BATCH or DOCA_FLOW_NO_WAIT . DOCA_FLOW_WAIT_FOR_BATCH means that this flow entry waits to be pushed to hardware. Batch flows then can be pushed only at once. This reduces the push times and enhances the insertion rate. DOCA_FLOW_NO_WAIT means that the flow entry is pushed to hardware immediately.

The parameter usr_ctx is handled in the callback defined in struct doca_flow_cfg .

doca_flow_entries_process processes all the flows in this queue. After the flow is handled and the status is returned, the callback is executed with the status and usr_ctx .

If the user does not define the callback in doca_flow_cfg , the user can get the status using doca_flow_entry_get_status to check if the flow has completed offloading or not.

In non-isolated mode (default) any received packets (e.g., following an RSS forward) can be processed by the DOCA application, bypassing the kernel. In the same way, the DOCA application can send packets to the NIC without kernel knowledge. This is why, by default, no replies are received when pinging a host with a running DOCA application. If only specific packet types (e.g., DNS packets) should be processed by the DOCA application, while other packets (e.g., ICMP ping) should be handled directly the kernel, then isolated mode becomes useful.

In isolated mode, packets that match root pipe entries are steered to the DOCA application (as usual) while other packets are received/sent directly by the kernel.

To activate isolated mode, in the struct doca_flow_cfg , the member mode_args represents DOCA applications. If it is defined with isolated (i.e., "vnf,hws,isolated" , "switch,isolated" ) then isolated mode is enabled.

If you plan to create a pipe with matches followed by action/monitor/forward operations, due to functional/performance considerations, it is advised that root pipe entries include the matches followed by a next pipe forward operation. In the next pipe, all the planned matches actions/monitor/forward operations could be specified. Unmatched packets are received and sent by the kernel.

When an entry is terminated by the user application or ages-out, the user should call the entry destroy function, doca_flow_pipe_rm_entry() . This frees the pipe entry and cancels hardware offload.

Whena pipe is terminated by the user application, the user should call the pipe destroy function, doca_flow_destroy_pipe() . This destroys the pipe and the pipe entries that match it.

When all pipes of a port are terminated by the user application, the user should call the pipe flush function, doca_flow_port_pipe_flush() . This destroys all pipes and all pipe entries belonging to this port.

When the port is not used anymore, the user should call the port destroy function, doca_flow_destroy_port() . This destroys the DOCA port and frees all resources of the port.