DOCA Documentation v3.2.0

DPL Program Metrics

The DPL compiler produces metrics information for compiled DPL programs to provide developers with insight into how their programs are implemented in the target hardware architecture. This information is critical for identifying potential performance bottlenecks, such as expensive P4 tables, complex P4 actions, or critical processing paths through a complete program.

The information in this section describes what visibility the DPL compiler currently provides and how it can be used to guide DPL program development to achieve required performance.

By default, the DPL compiler outputs metrics information in the p4-build directory alongside other standard compiler outputs

Enable/Disable Metrics

  • Default Behavior: Equivalent to including the compilation flag --enable metrics.

  • Disable Metrics: Generation can be turned off using the flag --disable metrics.

Generated Files

  • program_name.metrics.log: A text-based log summary of the compiled program's metrics.

  • program_name.metrics.json: A JSON-structured representation, useful for tracking metrics programmatically over time.

The metrics log is based on a "lowered" representation of the DPL program, mapping high-level P4 constructs to hardware-aligned blocks called matchers and rules.

  • Matcher: Generally correlates to a P4 Table.

  • Rule: Generally correlates to a P4 Action.

  • STE (Steering Table Entry): The fundamental hardware unit of work, consisting of a match phase and an action phase.

The log file contains two main sections:

3.1. DOCA HWS Metrics Report

This section summarizes resource usage and associated costs for matchers and rules. It begins with a summary of the total estimated memory usage and total unique STEs.

Column definitions:

Column

Description

ID

A unique identifier generated by the compiler, corresponding to ID values found in the program_name.program.json file

Table Name

The P4 table from which the matcher was derived (blank for synthesized matchers)

Domain

The hardware domain the matcher resides in. Currently allowed domains: FDB , FDB_RX , and FDB_TX .

Table Size

The capacity of rules (entries) the table can hold. This is expected to be a power of two due to memory allocation constraints.

Action Name

The P4 action from which the rule was derived.

Memory

Matcher: Bytes allocated to implement the matcher (enough for worst-case rules).

Rule: Bytes required for a single rule of that type.

Packet Processing

A unitless estimate of execution cost. Higher numbers indicate more processing time.

Matcher: Cost of processing a table miss (search + branching).

Rule: Cost to execute the rule.

Insertion

A unitless estimate of the cost to insert a rule into a given matcher. Complex rules with many STEs take longer to insert.

Match STEs

The number of STEs required to perform the match operation (typically 1 or 2 for BlueField targets).

Action STEs

The number of STEs required to perform all of a rule's actions. A value of 0 indicates actions are performed in the same STE as the match.


DOCA HWS Metrics Report

This section summarizes the worst-case execution paths through the DPL program based on two different criteria:

  1. Steps: The path requiring the highest number of STE steps.

  2. Cost: The path with the highest cumulative packet processing cost.

Column definitions:

Column

Description

Step

The step order of the execution path, at the granularity of a hardware rule.

Table

The matcher ID visited at this step (includes P4 table name when possible).

Entry

The rule ID visited at this step (includes P4 action name when possible.

STEs

The number of STEs executed to perform the rule.

Pkt Processing

A unitless estimate of the cost of exercising the rule at this specific step.

Total

The sum of STE count and Packet Processing cost for the entire execution path.


The metrics output serves as a guide for optimizing DPL programs. Consider the following principles:

  1. Minimize STE Execution: The total number of STEs executed has a direct adverse effect on packets per second (PPS). Fewer STEs result in faster processing times per packet.

  2. Simplify Rules: Rule complexity impacts both memory footprint and insertion cost. Smaller, simpler rules execute faster and are faster to insert from the control plane.

  3. Focus on Critical Paths: Complex tables or actions usually only impact performance when located on a critical program execution path. Optimizing rarely used paths (e.g., error handling) may not yield observable performance improvements.

  • Estimates Only: The output provided in these metrics are estimates.

  • Runtime Variables: The compiler cannot predict system caching performance or the specific characteristics of rules inserted at runtime.

  • Memory Impact: Because fetch/retrieval times from system memory contain unknowns, actual packet processing costs may vary from these static estimates

© Copyright 2025, NVIDIA. Last updated on Nov 20, 2025